Reverse Engineering Proprietary Data Formats
Binary analysis, pattern recognition, and building parsers from scratch
You inherit a legacy system with no documentation. The data is locked in proprietary binary files. The vendor went out of business five years ago. The engineer who built it left no notes. You need to extract millions of records, and the only thing you have is a directory full of .dat files and a hex editor.
This is reverse engineering. Not the glamorous kind with disassemblers and debuggers, but the tedious, systematic work of staring at binary dumps until patterns emerge. It is detective work: forming hypotheses, testing them, discarding what does not fit, and building up a mental model of how the format works.
I have done this for ultrasonic inspection systems, industrial control software, and scientific instrumentation where the hardware outlived the company. The process is always the same: start with nothing, end with a working parser.
When You Need to Reverse Engineer
There are a few scenarios where reverse engineering becomes unavoidable. The most common is legacy migration. You have a system that has been running for twenty years, storing data in a format nobody remembers. The vendor is gone, the documentation is lost, and you need to migrate the data to a modern platform. You cannot afford to lose two decades of operational history.
Another scenario is integration. You need to read data from a third-party system that does not provide an API or export functionality. The files are there, the format is undocumented, and your only option is to figure it out yourself. This happens more often than you would think, especially in industrial and scientific contexts where software is built once and then forgotten. The hardware keeps running, the software keeps logging, and nobody thinks about interoperability until it is too late.
Sometimes it is a matter of recovery. The system crashed, backups are incomplete, and you have raw data files that need to be salvaged. The export tool is broken or unavailable. The only way forward is through the binary.
In all these cases, reverse engineering is not optional. It is the only path to getting the data out.
Tools for Binary Analysis
The most important tool is a hex editor. I use hexdump or xxd on the command line, and ImHex or 010 Editor when I need a GUI. A good hex editor lets you view data in multiple formats: hex, ASCII, decimal, binary. You can search for patterns, annotate sections, and export subsets of the file.
File comparison tools are essential. diff works for text, but for binary files you need something like vbindiff or Beyond Compare. Generate two files with slightly different inputs and compare them. The differences tell you which bytes correspond to which fields.
Scripting is critical. Python with struct for parsing binary data, binascii for hex conversions, and numpy if you are dealing with arrays of numeric data. Write small scripts to test hypotheses. Can you parse the first 100 bytes? Good. Now try the first 1000.
Sometimes you need a disassembler or debugger. If you have access to the original software, run it under gdb or x64dbg and watch what it does when it reads or writes files. Set breakpoints on file I/O functions. This gives you ground truth about what the format is supposed to be.
Documentation of related formats helps. If you are reverse engineering a custom format, look for similar formats in the same domain. Scientific data formats like HDF5, NetCDF, and FITS have well-documented structures. Industrial formats like Modbus, Profibus, and CANbus have specifications. Even if your format is different, understanding how others solve similar problems gives you ideas about what to look for.
Pattern Recognition in Binary Files
The first thing you look for is a file signature, also called a magic number. Most binary formats start with a fixed sequence of bytes that identifies the file type. PNG files start with 89 50 4E 47. ZIP files start with 50 4B 03 04. PDF files start with 25 50 44 46, which is %PDF in ASCII.
If your format has a magic number, it is usually in the first 4-16 bytes. Open a few files and check if they all start the same way. If they do, you have your signature. If they do not, the format might not have one, or it might be variable.
Next, look for headers. Many formats have a header section that describes the rest of the file: version numbers, record counts, offsets to data sections, metadata. Headers are often at the beginning, but not always. Some formats put headers at the end or scatter them throughout.
Repeating patterns are a strong signal. If you see the same sequence of bytes every N bytes, you probably have an array of fixed-size records. Count the spacing. That is your record size. Look at what varies within each record. Those are your fields.
Alignment matters. Many formats align data to 4-byte or 8-byte boundaries for performance. If you see padding bytes (usually 00 or FF), that is a clue about alignment. Structures might be padded to align the next field.
ASCII strings stand out in hex dumps. If you see readable text, it is probably metadata: filenames, timestamps, user comments. Strings are often null-terminated (00) or length-prefixed (first byte is the length). Finding strings tells you where human-readable data lives, which helps map out the structure.
Identifying Data Structures
Once you have a rough idea of the layout, start identifying specific data types. Integers are common. Look for sequences that change in predictable ways. If you have a series of files numbered 1, 2, 3, look for bytes that increment correspondingly. That is your file index or record ID.
Timestamps are often stored as Unix epoch seconds (32-bit or 64-bit integers) or as Windows FILETIME (64-bit, 100-nanosecond intervals since 1601). If you see a 4-byte or 8-byte integer that looks like a large number, try interpreting it as a timestamp. Does the date make sense? If it is close to the file creation date, you found your timestamp field. Epoch seconds since 1970-01-01 00:00:00 UTC are around 1.7 billion as of 2024. If you see an integer in that range, it is likely a timestamp. Convert it and check if the date makes sense in the context of the data.
Floating-point numbers are trickier. IEEE 754 single-precision (4 bytes) and double-precision (8 bytes) are standard, but some systems use custom formats. If you expect numeric data, try parsing bytes as floats. If you get reasonable values, you are on the right track. If you get nonsense, try different byte orders (little-endian vs big-endian) or different precisions.
Arrays are sequences of the same type. If you have sensor data, you might see arrays of floats representing measurements over time. Count the elements. Multiply by the size of each element. Does it match the chunk size you identified earlier? If so, you have found your data payload.
Nested structures are harder. A record might contain a fixed header followed by a variable-length payload. Look for length fields: a 2-byte or 4-byte integer that tells you how many bytes follow. Parse that many bytes, then move to the next record. If the math works out, you have the right structure.
Building Parsers Incrementally
Do not try to parse the entire file at once. Start small. Write a script that reads the first 16 bytes and interprets them as header fields. Print the results. Does it make sense? Good. Now read the next 16 bytes.
Validate as you go. If you think a field is a record count, parse that many records and see if you hit the end of the file. If you overshoot or undershoot, your hypothesis is wrong. Adjust and try again.
Use assertions. If a field should always be a specific value, assert it. If you are wrong, the script will fail loudly, and you will know exactly where your model broke.
Test on multiple files. A parser that works on one file might fail on another. Edge cases reveal bugs in your understanding. Files with zero records, files with maximum records, files with unusual metadata. Test them all.
Write documentation as you go. Not for others, but for yourself. When you come back to this in six months, you will not remember why you decided byte 12 was a checksum. Write it down. Annotate your code. Keep a notebook with hex dumps and observations.
Testing Hypotheses
Reverse engineering is hypothesis-driven. You observe a pattern, form a theory about what it means, and test it. If the test fails, revise the theory.
Generate test data if you can. If you still have access to the software that creates these files, use it to generate controlled inputs. Create a file with one record, then two, then ten. Compare them. What changes? That tells you where the record count is stored and how records are laid out.
Create extreme cases. Empty files. Files with maximum values. Files with special characters in strings. See what breaks. See what stays consistent.
Cross-validate with external knowledge. If the file is supposed to contain temperature readings from a sensor, and you are getting values like 50000, you are probably parsing it wrong. Physical constraints help you sanity-check your interpretations.
Legal Considerations
Reverse engineering is legal in many jurisdictions under fair use or interoperability exceptions, but it depends on the context. In the United States, the DMCA allows reverse engineering for interoperability. In the EU, the Software Directive has similar provisions.
That said, check your contracts. If you signed an NDA or a license agreement that prohibits reverse engineering, you are bound by that. Even if the law allows it, the contract might not.
If the format is protected by patents or trade secrets, you are in murkier territory. Extracting data from your own files for your own use is generally safe. Distributing a tool that reads a proprietary format might not be.
When in doubt, consult a lawyer. The last thing you want is a lawsuit over a data migration project.
A Real-World Example
A few years ago, I worked with a client who had a legacy industrial inspection system. The hardware was still running, but the vendor had gone out of business. They needed to extract historical inspection data to migrate to a new system. The data was in .insp files, binary format, no documentation.
I started with a hex dump of a few files. The first 8 bytes were always 49 4E 53 50 00 01 00 00. That is INSP in ASCII followed by what looked like a version number. Good. I had a file signature and a version field.
The next 4 bytes varied between files. I compared files with different numbers of inspections and noticed the value matched the inspection count. That was my record count field, stored as a 32-bit little-endian integer.
After the header, I saw repeating blocks of 256 bytes. Each block started with a recognizable pattern: a timestamp (Unix epoch, 4 bytes), followed by ASCII strings (inspector name, part ID), followed by numeric data (measurements). The last 4 bytes of each block were a checksum, which I verified by summing the preceding bytes modulo 2^32.
I wrote a Python script to parse the format: read the header, extract the record count, iterate over records, unpack each field using struct.unpack, validate checksums, and export to CSV. The first version failed on files with non-ASCII characters in the inspector names. I switched to UTF-8 decoding with error handling. The second version worked on all test files.
Total time: about two days. The client had ten years of data across thousands of files. The parser processed all of them without errors. Data migration complete.
Final Thoughts
Reverse engineering proprietary formats is not glamorous, but it is essential when documentation does not exist. The process is systematic: identify patterns, form hypotheses, test them, build incrementally, validate constantly. It requires patience, attention to detail, and a willingness to be wrong repeatedly until you are right.
The tools are simple: hex editors, scripting languages, and a methodical approach. The skills transfer across domains. Once you have reversed one binary format, the next one is easier. Patterns repeat. Structures recur. You start to recognize common design choices.
If you find yourself in this situation, do not panic. Start with one file. Look for the signature. Find the header. Identify one field. Then another. Build up slowly. Document everything. Test constantly. You will get there. And when you do, write down what you learned. The next person to inherit this mess will thank you. Future you will thank you. Documentation is the difference between archaeology and engineering.