What are Metadata and Hidden Data?
Metadata
Metadata refers to information “about” a record. For a Microsoft Word document, this could include information such as the author of the document, the date and time the document was created, the time spent editing the document, its length, its description, comments about the document, etc. For a digital photograph, the metadata might include the type of camera or other device used to create the image, the location where it was taken, and the date and time it was taken, as well as other information.
Electronic records, whether they are Word documents, PDF files, image files, or otherwise, almost always contain metadata which is embedded inside the record itself. Although this information can be retrieved using appropriate software, it is mostly invisible to anyone who isn’t looking for it – i.e., anyone who merely opens, views, edits, or prints the file.
Hidden Data
The term “hidden data” includes the metadata embedded in a file, but also has a broader meaning. Hidden data can refer to any extra information in a document or image which is not immediately visible to the user. For example, text can be hidden in a Word document by setting its colour to white, or by setting its background colour to black. While such text will generally not be visible when a document is viewed or printed, the text still exists in the Word document and can be retrieved (such as by changing the text style or through other means). Microsoft Word documents may also contain tracked changes and thereby retain deleted text, earlier versions of the text, and comments from reviewers. These changes, comments and deleted text are sometimes hidden from view when the document is opened, and may not appear in the printed version of the document either, but they can be easily retrieved by anyone who has the electronic version of the document.
Image files can contain “hidden data” too, such as a watermark that is not visible to human eyes but whose data can be retrieved using an appropriate software tool, or “artifacts” that can leave traces of how an image has been edited or modified.
Does your institution know what it is disclosing?
Does your institution regularly disclose electronic versions of documents to requestors when asked to do so? Have you ever given thought to exactly what information may be hidden in the electronic version of a document?
In today’s article, we’ll describe some of the hidden information that may be exposed through the disclosure of electronic records, and consider whether routinely disclosing electronic versions of documents is a safe practice from a privacy perspective.
Plain Text vs. Microsoft Word
Let’s demonstrate how much “extra” information can be hidden inside a basic Word file.
We’ll start by creating a small “plain text” file.
1. In Windows Notepad, type the text “This is a test.” and then save the file:
2. After saving this file as plain text, we can right-click on the saved file and then click on “Properties” to see the file’s size:
This is what comes up:
The size of the file is 15 bytes. This makes sense, as the text “This is a test.” is itself 15 characters long (including letters, spaces and the period at the end). Each character takes one byte of disk space to store on the computer.
3. Next, let’s open this same file in Microsoft Word:
4. And now we’ll save it as a Microsoft Word file (with a .docx extension).
5. We can then right-click on the new file and then click on “Properties” to see the Word file’s size:
The Microsoft Word file containing just the text “This is a test.” is 12,164 bytes, or nearly 1,000 times the size of the original plain text file we created in Windows Notepad. This is enough space to store over 12,000 characters. That is a lot of data! A standard 8.5 x 11 inch page holds about 4,000 characters of text (single-spaced), so 12,164 characters is about three single-spaced pages’ worth of information.
Besides the 15 bytes required to store the text “This is a test.”, what other information is being stored in this Microsoft Word file? We can’t simply open the file in Windows Notepad to examine it directly, because not all of the information is stored in a human-readable format. Helpfully, Microsoft has published an article which describes the information stored in various Microsoft Office file formats, including Word, Excel, PowerPoint, and Visio.
Types of Hidden Data in Microsoft Word Files
Let’s see what kinds of data may be contained in a Word document, according to Microsoft:
Word documents can contain the following types of hidden data and personal information:
- Comments, revision marks from tracked changes, versions, and ink annotations If you’ve collaborated with other people to create your document, your document might contain items such as revision marks from tracked changes, comments, ink annotations, or versions. This information can enable other people to see the names of people who worked on your document, comments from reviewers, and changes that were made to your document, things that you might not want to share outside of your team.
- Document properties and personal information Document properties, or metadata, include details about your document such as author, subject, and title. Document properties also include information that is automatically maintained by Office programs, such as the name of the person who most recently saved a document and the date when a document was created. If you used specific features, your document might also contain additional kinds of personally identifiable information (PII), such as e-mail headers, send-for-review information, routing slips, and template names.
- Headers, footers, and watermarks Word documents can contain information in headers and footers. Additionally, you might have added a watermark to your Word document.
- Hidden text Word documents can contain text that is formatted as hidden text. If you do not know whether your document contains hidden text, you can use the Document Inspector to search for it.
- Document server properties If your document was saved to a location on a document management server, such as a Document Workspace site or a library based on Windows SharePoint Services, the document might contain additional document properties or information related to this server location.
- Custom XML data Documents can contain custom XML data that is not visible in the document itself. The Document Inspector can find and remove this XML data.
From a privacy perspective, the following items from the excerpt above may raise concerns:
Comments and revision marks
A document may contain comments and revisions made by various contributors that the institution would never intentionally disclose. Such comments may be subject to a number of exemptions under the Freedom of Information and Protection of Privacy Act (FIPPA) and the Municipal Freedom of Protection and Privacy Act (MFIPPA), with some of the most relevant being Advice or recommendations (FIPPA s.13 / MFIPPA s.7); Relations with other governments (FIPPA s.15); Third party information (FIPPA s.17 / MFIPPA s.10); Economic and other interests of Ontario (FIPPA s.18 / MFIPPA s.11); and Solicitor-client privilege (FIPPA s.19 / MFIPPA s.12). These comments and revision marks may not be immediately visible when a Microsoft Word document is opened and viewed or printed, but they will be included any time the electronic version is disclosed. And they may include text that the institution intended to delete from the document.
Details about your document such as author, subject, and title
Disclosing the “author” of a document is not necessarily problematic – however, the “author” information may not be accurate. For example, a document created from a precedent may include the name of the author of the precedent rather than the author of the version of the document disclosed to the requestor.
Additional kinds of personally identifiable information (PII), such as e-mail headers, send-for-review information, routing slips, and template names
Notably, Microsoft states here that Word documents may contain personally identifiable information (PII) without providing an exhaustive list of what kids of PII information may be included, raising the spectre of an unauthorized disclosure of personal information in breach of FIPPA s.21 and MFIPPA s.14.
Hidden text
Documents can contain text that is not immediately visible when the document is opened or printed. For example, text may be “redacted” by changing its background colour to black. This will give the appearance of redacted text when the document is viewed, but the “redaction” can be easily reversed by anyone who receives the electronic version. Likewise, any white text on a white background will not immediately be visible, but can be easily viewed by anyone who has the original file.
An additional concern: “Fast Saving”
Additionally, Microsoft Word includes a “fast save” feature, which makes saving faster by storing all changes incrementally as they are made, instead of generating a new file from scratch each time a document is saved. The advantage of using “fast saving” is that it makes saving a document much faster; the downside is that potentially every note and change that was saved in the document at any point while it was being drafted may be preserved in the final electronic record.
So What’s the Problem?
Any data that is included in a Microsoft Word file is accessible to anyone who has a copy of the file and knows of how to get to that information—potentially using software tools that are not available or not known to the disclosing institution.
Microsoft Word files are merely one example; the point of the deep-dive above was not to single out Microsoft Word files as particularly problematic. Rather, I am attempting to illustrate how nearly any electronic file format may contain various types information that an institution would never intentionally disclose. For example, Excel files are similarly problematic:
These are some of the items that can be the source of hidden data and personal information in your Excel workbooks:
- Comments and ink annotations
- Document properties and personal information
- Metadata or document properties
- Headers and footers
- Hidden rows, columns, and worksheets
- Document server properties
- Custom XML data
- Invisible content
- External links
- Embedded files or objects
- Macros of VBA code
- Items that may have cached data
- Excel Surveys
- Scenario Manager scenarios
- Filters
- Hidden names
Microsoft advises using the “Document Inspector” to remove hidden data in Word and Excel documents:
It is a good idea to use the Document Inspector before you share an electronic copy of your [document], such as in an e-mail attachment.
Institutions may well wish to consider whether Microsoft’s Document Inspector can be employed to solve the conundrum of hidden data in Word, Excel and PowerPoint documents. But what about all of the other types of files an institution may use?
Other File Types That May Contain Hidden Data
Here are a number of other commonly-used file types that may contain hidden information / metadata:
Image Files (e.g., JPG, PNG, TIFF)
JPG files contain “EXIF” data which is metadata about the image inside the file. Wikipedia notes:
Since the Exif tag contains metadata about the photo, it can pose a privacy problem. For example, a photo taken with a GPS-enabled camera can reveal the exact location and time it was taken, and the unique ID number of the device – this is all done by default – often without the user’s knowledge. Many users may be unaware that their photos are tagged by default in this manner, or that specialist software may be required to remove the Exif tag before publishing. For example, a whistleblower, journalist or political dissident relying on the protection of anonymity to allow them to report malfeasance by a corporate entity, criminal, or government may therefore find their safety compromised by this default data collection.
PNG files also contain metadata, including arbitrary text which may be embedded in the file as the “description” of the image. (For a lighter take, see “Hidden data in your image files”.)
TIFF files can contain hundreds of different kinds of tags, and may even contain other embedded image files, such as JPG files, which may contain their own, separate metadata.
Adobe PDF files
PDF files may contain the author’s name, keywords, and copyright information, and may also contain arbitrary metadata. PDF files may also contain hidden text. In one notorious case, a court error led to “redacted” information in a case between Apple Inc and Samsung Electronics being viewable to the public. The court had blacked out certain text in the PDF version of its judgment which was released to the public. However, simply cutting-and-pasting the document revealed the hidden text that the court had intended to redact. “Properly” redacting a PDF file (so that the redacted text cannot be restored by the recipient of the file) requires additional software.
Email Files (e.g., Outlook .msg files and .pst files)
All email files contain metadata. Some is related to the message: “To”, “From”, “CC”, “BCC”, “Subject”, “Date sent”, “Time sent”, etc. The exact metadata stored in an email message may depend on the email client used and the server used to store and transmit the message.
Other File Types
There’s no such thing as an exhaustive list of file types. One website, “fileinfo.com” offers information on “over 10,000 file extensions and software programs,” which shows just how many file types are out there.
What steps can an institution take to ensure “hidden data” is not unintentionally disclosed?
Institutions should employ a strategy that can handle both familiar file types (such as Word documents and PDF’s) as well as unusual file types that may not have much associated documentation or any dedicated tools for removing hidden data.
There are two general approaches institutions can take to ensure they do not disclose “hidden data” unintentionally:
-
Use software tools to eliminate metadata and hidden data from electronic files
For many commonly-used file formats, there are software tools that are intended to help organizations prepare such files for disclosure. Sometimes, these tools are designed for litigants and lawyers’ offices, but they can also be used by provincial and municipal institutions as part of their FOI programs.
The problem with using software tools to eliminate metadata and hidden data is that different file types will often require different software tools, and the tools themselves may be expensive and/or hard to learn. For Adobe PDF documents, for example, Adobe offers “Adobe Acrobat Pro DC”, which allows organizations to make permanent irreversible redactions, and to remove metadata, hidden text, and other kinds of hidden data from PDF files. Unfortunately, there can be a significant time investment involved in learning how to use this tool properly, and the software is available only via a paid subscription from Adobe. And even if an institution chooses to adopt this tool for PDF documents, other tools may still be needed for other file types that the institution may use (e.g., Word documents, Excel documents, PowerPoint, various image types, and email files), either to remove hidden data from the original files themselves, or to convert from various file formats into another format (perhaps TIFF or PDF) that the institution is able to process reliably.
An institution may wish to consider employing a strategy of first converting all of the files that it is intending to disclose into a single format, such as TIFF or PDF, and then using a specialized tool on the resulting files (which will now all be in the same format) to ensure that no hidden data remains in the documents. For example, an institution may wish to convert all of its responsive Word documents, Excel documents, image files, and email files into PDF files, and then use a tool such as Adobe Acrobat Pro DC to reliably redact and remove any hidden data from the resulting PDF files. This multi-step process may be somewhat burdensome for the institution but should result in “clean” electronic files that are ready to be disclosed.
-
Use paper disclosure rather than electronic disclosure
The main advantage of the “paper method” is that it works for nearly all document and image file formats. This method generally doesn’t require consultations with information technology staff or the adoption of new software – it’s a one-size-fits-all solution that works for just about any electronic record that a requestor might ask for. Some institutions may not have easy access to software tools or to information technology staff who can advise them on how to process various file formats to effectively limit the risk of hidden data being disclosed. For such institutions, the “paper method” may be the most reasonable option.
By printing a document, an FOI professional can ensure they are seeing all of the content that will eventually be disclosed to the requestor. Paper copies don’t come with metadata and generally don’t include “hidden data” (at least not as that term is used with respect to electronic documents). Paper can be an extremely convenient format to work with when categorizing and redacting documents. Further, no lengthy learning or training period is required when working with paper, unlike some of the software tools that an institution may consider using as an alternative. The main downside of paper is that requestors will occasionally request that documents be provided in an electronic format, and (as will be discussed in a future article), there are Information and Privacy Commissioner of Ontario (IPCO) decisions that indicate that a requestor generally has the right to receive documents in the format they request. That said, printouts can always be scanned back into PDF, TIFF or another digital format as a straightforward “final step” if the institution wishes to comply with a request to produce records electronically; the scanning process does not restore any of the hidden data that may have been in the digital original.
Note that there is the potential for a fee dispute with the requestor if an institution attempts to charge for copies made in order to prepare the documents for disclosure but which the requestor never actually receives. Whether an institution would be able to successfully defend the costs of printing documents and then scanning them to remove metadata may depend in part on whether IPCO can be convinced during the appeal process that printing the documents was a necessary part of fulfilling the access request. (Further guidance on the fees that an institution may be permitted to charge when providing a requestor with digital disclosure will be set out in a future article.)
A warning about “print-to-PDF”
Be warned: A PDF generated by simply using software to convert a document to PDF, or by using the “print to PDF” option, may still include hidden information. Specifically, “invisible text”, such as white text on a white background, or regular text which has been obscured by another shape (perhaps in a naïve attempt to redact the text), may be preserved in the resulting PDF. Simply converting a document to PDF, without taking the additional step of using a tool such as Adobe Acrobat Pro DC to remove hidden data from the resulting PDF, may result in a document whose hidden or “redacted” text is easily retrieved by anyone who knows how to find it, similar to how the redacted text was retrieved from the published judgment in the Apple v. Samsung case mentioned earlier.
Additionally, depending on which tool is being used to convert the original document to PDF and how it is configured, some of the metadata embedded in the original document may also be preserved in the resulting PDF document.
Would IPCO order an institution to disclose the original electronic versions despite these concerns?
Requestors often ask for documents to be provided electronically. A requestor might appeal an institution’s decision to only provide printed copies, or even to print-and-scan electronic documents before disclosing them to the requestor.
There are three main objections a requestor might raise if an institution insists on producing paper copies, or even “printed-then-scanned” electronic copies:
- The resulting records are missing metadata and hidden data that may have been contained in the original;
- The resulting records are more difficult to search and navigate;
- The requestor may object to being charged for “unnecessary” printouts (whether or not such printouts are provided to the requestor or are merely created as an intermediate step).
As a rebuttal, an institution might respond that any metadata and hidden data in the documents were created involuntarily and unintentionally and the institution does not consider such data to be part of the record. The institution might take the position that it is unable to know what hidden data is stored inside each record and that it lacks the technical ability to ensure that personal information, confidential information, and other excludable and exempt information is not disclosed unless it simply removes all hidden data prior to disclosing the records to the requestor. And if the institution is able to convince IPCO that the “print-then-scan” method is the most reasonable way for the institution to remove hidden data, they may potentially succeed on the costs point as well. (Certainly the institution could offer to send the requestor both the printed and the scanned versions of the documents, at the requestor’s option.)
If an institution raised concerns about metadata and other hidden data on appeal, would IPCO ever order an institution to release the original electronic versions of documents, despite the institution’s concerns? There does not yet appear to be any IPCO decision directly on point. Of course, any particular decision may turn on the specific facts, e.g., had the institution genuinely considered the hidden data issue for the particular files subject to disclosure? What were the results of its analysis? (For example, a few file types, such as “plain text” (.txt) files and “comma separated values” (.csv) files, generally do not contain metadata or hidden data.) But I would note that IPCO has two competing concerns here: IPCO’s role is to “provide the public with a right of access to government-held information and access to their own personal information” while also “ensuring that any personal information held by public institutions and health care providers will remain private and secure.” An IPCO adjudicator is not in a good position to assess what metadata and other hidden data may be incorporated in the original electronic versions of documents, nor are they in a position to mandate exactly what steps an institution must take to ensure hidden data is not disclosed in error. In a hypothetical conflict between a requestor’s desire to navigate through disclosure documents more easily versus the institution’s obligation to keep personal, confidential and other exempt information secure, I suspect that IPCO may hesitate before “ordering out” original electronic documents, as doing so might well result in IPCO itself being directly responsible for a breach of privacy or confidentiality.
I encourage you to refer this article to a colleague, and to subscribe to the FOI Assist blog. To subscribe, simply enter your email address at the bottom of the page then click the follow button.
Links to Resources:
Freedom of Information and Protection of Privacy Act (FIPPA) https://www.ontario.ca/laws/statute/90f31
Municipal Freedom of Information and Protection of Privacy Act (MFIPPA) https://www.ontario.ca/laws/statute/90m56
Microsoft: Remove hidden data and personal information by inspecting documents, presentations, or workbooks
https://support.office.com/en-gb/article/remove-hidden-data-and-personal-information-by-inspecting-documents-presentations-or-workbooks-356b7b5d-77af-44fe-a07f-9aa4d085966f
James Marshall: How to Disable the Microsoft Word Fast Save Feature (Lifewire) https://www.lifewire.com/disable-the-fast-save-feature-word-3539746
Wikipedia: Exif
https://en.wikipedia.org/wiki/Exif
PNG (Portable Network Graphics) Specification, Version 1.2, s.4. Chunk Specifications
http://www.libpng.org/pub/png/spec/1.2/PNG-Chunks.html#C.Anc-text
Colt McAnlis: Hidden data in your image files
https://medium.com/@duhroach/hidden-data-in-your-image-files-a68ad61081b8
Sustainability of Digital Formats: Planning for Library of Congress Collections
Still Images: Tags for TIFF, DNG, and Related Specifications
https://www.loc.gov/preservation/digital/formats/content/tiff_tags.shtml
Adobe: PDF properties and metadata
https://helpx.adobe.com/acrobat/using/pdf-properties-metadata.html
Debra Cassens Weiss: Cut-and-Paste Reveals Redacted Info on Apple Smartphone Market in Federal Judge’s Opinion (ABA Journal)
http://www.abajournal.com/news/article/cut-and-paste_reveals_redacted_info_on_apple_smartphone_market_in_federal_j/
Lisa Needham: How To (Properly) Redact a PDF (Lawyerist)
https://lawyerist.com/how-to-redact-a-pdf/
Peter Coons: eDiscovery Update Email file types and how to handle them (Legalnews.com)
http://legalnews.com/washtenaw/1427534
FileInfo – The File Extensions Database
https://fileinfo.com
Adobe Acrobat Pro DC
https://acrobat.adobe.com/ca/en/acrobat/pricing.html
Information and Privacy Commissioner of Ontario: Role and Mandate https://www.ipc.on.ca/about-us/role-and-mandate/
Information and Privacy Commissioner of Ontario: Decisions https://decisions.ipc.on.ca/ipc-cipvp/en/nav.do