File Types | João Paulo Barraca

Lecture Notes

Introduction to how files are structured, MIME types and some simple obfuscation.

Download here

Practical Tasks

An important part of the RE is related to file formats. We will explore some subtleties of this in another class, but now we will focus on basic identification of files and their contents. This is mandatory for any RE tasks and usually the first step: take a look at the binary and determine what it is. If some parts are missing, they may need to be reconstructed.

Most file formats begin with a header, a few bytes that describe the file type and version. Because there are several incompatible file formats with the same extension (for example, .doc and .cod), the header gives a program enough additional information to see if this file is one of the formats that program can handle. Many programmers package their data in some sort of “container format” before writing it out to disk. If they use the standard zlib to hold their data in compressed form, the file will begin with the 2 bytes 0x1f 0x8b (in decimal, 31 139 ).

Some files are made up largely of blank space, for example, .ds_store files generated by OS X. Blank space will appear as a series of 0’s in a hex editor. The creators of a file format may add blank space for a variety of reasons, for example, the author of this study on .ds_store files speculated that they exist to speed up writing data, as other data would not need to be pushed around to make room. They could also serve to prevent fragmentation.

For most purposes, blank space can be ignored.

File format reverse engineering is the domain of hex editors, signature matching, pattern matching and even visual analysis. Hex editors are used more often to display file contents as opposed to editing them. The also allow you to superimpose a data structure on top of the data (sometimes called custom views or similar), which are very helpful. Once a particular structure has been discovered in a file, these mechanisms can be used to document the structure, as well as to provide a more meaningful display of the information than just hex code.

Also useful are Unix/Linux tools like strings and file or TrID.

Another approach is to use visual identification in order to understand the pattern presented by the bytes. The idea is not new and you can find an academic reference in “Visual Reverse Engineering of Binary and Data Files” by Gregory Conti et al. Tools such as ImHex, binvis and veles may help you. Especially if you have another file of the same type.

Strategies

Look for the obvious first. E.g. magic numbers, a block structure, ASCII text in the file. Anything that can be more or less identified clearly can be the entry ticket to more. Once a particular structure has been identified, look for in-file pointers to that data. E.g. if the data is referenced from some other part of the file with an absolute or relative address. It is also very important to find out the byte order (little endian or big endian).

Choosing the target

If you have access to the software that created the file, you can always create files with the contents of your choosing. This makes reverse engineering substantially easier. In cryptography terms, you are engaging in a w:chosen-plaintext attack.

Probing

Once you formulate theory as to what some data in the file might mean, you can verify that theory by creating a manipulated file. Replace it with some other data using a hex editor or a custom tool. Then load the manipulated file into the original application. If the application loads the file and displays the intended change, the theory is probably correct. Sometimes it is not trivial to change the application and reload it because of the defense mechanism that may be present. Some application check the hash and signature of the code before running it.

Compression and scrambling

File formats which are either in part or completely compressed, encrypted or scrambled are among the toughest nuts to crack. Of course, compression is different from encryption, and typically done for a different purpose. However, the resulting file formats often look similar: A bunch of gibberish. This is the intended result when file format designers go for encryption, but it is also often a desired side effect when compression is applied.

If checking a file with a hex editor or similar reveals that it just contains gibberish and e.g. not any easy to identify text strings, patterns or similar, it might indicate that the particular file is compressed, encrypted or scrambled. The methods for reverse engineering these files are similar. There might, however, be a big difference from a legal point of view. Many countries have laws against circumventing copy protection, and encryption can be seen as some kind of copy protection. See Reverse Engineering/Legal Aspects for some more hints regarding this, and seek qualified legal advice before attempting to reverse engineer an encrypted or otherwise protected file format. Similar issues might arise when a file format just uses scrambling. The format “owner” might argue that the scrambling is used as some kind of copy protection, encryption or whatever, and circumventing it might break some law. Again, seek qualified legal advice.

Tasks

For this task, determine the filetype of all files in this package, and access their content. Take notes of your attempts and record what steps you took to find the file type. Some files are corrupted and may require some additional bytes. Some files are correlated. Some files are strange.

The files are split into three folders, according to the technique used.

1 - extension: The file extension is correct and matches the content.
2 - magic: These are the same files, but file extension is removed. Find potential mismatches, and see how the different files appear under a hex editor.
3 - obfuscated: Thee are not the same files, the magic header is corrupted or shifted and the extension is removed. Using what you observed from the previous files, what can you find here? New formats are also added! Solutions will be presented on the next class.
4 - polyglots: These are examples of polyglots, to illustrate the concept. For information regarding the file formats, check https://github.com/corkami/pics

For each file:

Determine the file type
Take notes about the file content, and your doubts and strategies
If a file has sub-files or sub-elements, extract them

Assume that files may be malicious! Do not try to execute them outside a secure environment.

As tools, we recommend using:

A hex editor such as ImHex, HexWorkShop, HxD, Okteta
command line tools such as file, strings, binwalk or foremost
A 3D visualizer may help, such as Binocle, Veles, Binvis.io or the entropy visualization commonly present in HexEditors.

References

With a broader scope, but as mandatory reading for the curious mind, also check the great works at PoC||GTFO: https://pocorgtfo.hacke.rs/
Some contents originated from https://en.wikibooks.org/wiki/Reverse_Engineering/File_Formats and made available following the CC-ASL license.