Intro To Reverse Engineering
Introduction to Reverse Engineering, the objectives, scope and relevant concepts. Also, some brief discussion about the type of tools with examples.
An important part of the RE is related to file formats. We will explore some subtleties of this in another class, but now we will focus on basic identification of files and their contents. This is mandatory for any RE tasks and usually the first step: take a look at the binary and determine what it is. If some parts are missing, they may need to be reconstructed.
Most file formats begin with a “header,” a few bytes that describe the file type and version. Because there are several incompatible file formats with the same extension (for example, “.doc” and “.cod”), the header gives a program enough additional information to see if this file is one of the formats that program can handle. Many programmers package their data in some sort of “container format” before writing it out to disk. If they use the standard zlib to hold their data in compressed form, the file will begin with the 2 bytes 0x1f 0x8b (in decimal, 31 139 ).
Some files are made up largely of blank space, for example, .ds_store files generated by OS X. Blank space will appear as a series of 0’s in a hex editor. The creators of a file format may add blank space for a variety of reasons, for example, the author of this study on .ds_store files speculated that they exist to speed up writing data, as other data would not need to be pushed around to make room. They could also serve to prevent fragmentation.
For most purposes, blank space can be ignored.
File format reverse engineering is the domain of hex editors. Typically they are used more often to display file contents as opposed to editing them. Hex editors allow you to superimpose a data structure on top of the data (sometimes called custom views or similar), which are very helpful. Once a particular structure has been discovered in a file, these mechanisms can be used to document the structure, as well as to provide a more meaningful display of the information than just hex code.
Also useful are Unix/Linux tools like strings(1) and file(1).
Another approach is to use visual identification in order to understand the pattern presented by the bytes. The idea is not new and you can find an academic reference in “Visual Reverse Engineering of Binary and Data Files” by Gregory Conti et al. Tools such as HexWorkshop, binvis and veles may help you. Especially if you have another file of the same type.
Look for the obvious first. E.g. magic numbers, a block structure, ASCII text in the file. Anything that can be more or less identified clearly can be the entry ticket to more. Once a particular structure has been identified, look for in-file pointers to that data. E.g. if the data is referenced from some other part of the file with an absolute or relative address. It is also very important to find out the byte order (little endian or big endian).
Choosing the target
If you have access to the software that created the file, you can always create files with the contents of your choosing. This makes reverse engineering substantially easier. In cryptography terms, you are engaging in a w:chosen-plaintext attack.
Once you formulate theory as to what some data in the file might mean, you can verify that theory by creating a manipulated file. Replace it with some other data using a hex editor or a custom tool. Then load the manipulated file into the original application. If the application loads the file and displays the intended change, the theory is probably correct. Sometimes it is not trivial to change the application and reload it because of the defense mechanism that may be present. Some application check the hash and signature of the code before running it.
Compression and scrambling
File formats which are either in part or completely compressed, encrypted or scrambled are among the toughest nuts to crack. Of course, compression is different from encryption, and typically done for a different purpose. However, the resulting file formats often look similar: A bunch of gibberish. This is the intended result when file format designers go for encryption, but it is also often a desired side effect when compression is applied.
If checking a file with a hex editor or similar reveals that it just contains gibberish and e.g. not any easy to identify text strings, patterns or similar, it might indicate that the particular file is compressed, encrypted or scrambled. The methods for reverse engineering these files are similar. There might, however, be a big difference from a legal point of view. Many countries have laws against circumventing copy protection, and encryption can be seen as some kind of copy protection. See Reverse Engineering/Legal Aspects for some more hints regarding this, and seek qualified legal advice before attempting to reverse engineer an encrypted or otherwise protected file format. Similar issues might arise when a file format just uses scrambling. The format “owner” might argue that the scrambling is used as some kind of copy protection, encryption or whatever, and circumventing it might break some law. Again, seek qualified legal advice.
For this task, determine the filetype of all files in this package, and access their content. Take notes of your attempts and record what steps you took to find the file type. Some files are corrupted and may require some additional bytes. Some files are correlated. Some files are strange.
For information regarding the file formats, check https://github.com/corkami/pics
The following papers provide a through insight towards reverse engineering tasks, activities and scope, as well as some interesting examples.
- E. J. Chikofsky and J. H. Cross, “Reverse engineering and design recovery: a taxonomy,” in IEEE Software, vol. 7, no. 1, pp. 13-17, Jan. 1990, doi: 10.1109/52.43044., Reverse engineering and design recovery: a taxonomy | IEEE Journals & Magazine | IEEE Xplore
- Jens de Hoog, Toon Bogaerts, Wim Casteels, Siegfried Mercelis, Peter Hellinckx,Online reverse engineering of CAN data, Internet of Things, Volume 11, 2020, 100232, ISSN 2542-6605, https://doi.org/10.1016/j.iot.2020.100232.
- Ziadia, M.; Fattahi, J.; Mejri, M.; Pricop, E. Smali+: An Operational Semantics for Low-Level Code Generated from Reverse Engineering Android Applications. Information 2020, 11, 130. https://doi.org/10.3390/info11030130
- G. Kim, M. Ma and I. Park, “A fast and flexible software for IC reverse engineering,” 2018 International Conference on Electronics, Information, and Communication (ICEIC), Honolulu, HI, USA, 2018, pp. 1-4, doi: 10.23919/ELINFOCOM.2018.8330639.
With a broader scope, but as mandatory reading for the curious mind, also check the great works at PoC||GTFO: https://pocorgtfo.hacke.rs/
Some contents originated from https://en.wikibooks.org/wiki/Reverse_Engineering/File_Formats and made available following the CC-ASL license.