File Forensics: Unziping Word Documents to see XML Source

Have you ever tried to open a Word Docx file in notepad? If so, then you know that you get a screen full of random mess that looks something like this:

If the document is written in XML, then you should see formatted, readable text. So why don’t you? The key is the first two readable characters that show up in the picture above – “PK”.

The answer is that the Word data files are zipped! Since DOS days, all zip files when viewed as text start with the characters PK. All you need to do is run the Docx file through an unzip program and you can see several files and folders full of XML data:

The files can now be opened in notepad, but if you just double click on them, they will open in your web browser and be a bit more readable. Browsing through the newly created folders and you will find a ton of formatting information and the complete text of the document.

But you will also find information that could be very useful for forensics. Including file revision, creation and modify dates, document creator and who was the last one to modify the document:

Apparently, this type of forensics was used to catch the guy that put a collar bomb on a high school student in Australia. Forensics examiners found the bombers name hidden in documents on a USB drive draped around the victims neck.

For more information, including a forensics recreation, check out “Forensic Examinations 5 – File Signatures, Metadata And The Collar Bomber – Part 2“.

One thought on “File Forensics: Unziping Word Documents to see XML Source”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.