ExpressEdU
ExpressEdU

Reputation: 11

How to find out and undo encoding/data structure of log files?

I have to analyze log files of a software which has run out of support (the software provider has ceased) to trace down an issue. Unfortunately, the log data is not stored in plain (human-readable) text and there exists no documentation on how to interpret the log files. That is why I would like to develop an algorithm to make the data human-readable - preferrably in Java or Python, but in the first step I need to understand the data structure.

Please find below the information I have collected so far and what I have already tried:

Characteristics observed by browsing log files in notepad++ & HxD:

  1. The application is running in a Windows environment.
  2. The files are stored in a directory LogFiles and have filetype .log. At each application start, a new log file is created which is used until termination of the application. The logging process is not the same as the application process but runs in parallel. It is a native process (not managed).
  3. There is a 16 byte header at the beginning of each log file: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 (hex extracted with HxD)
  4. The header is followed by a string consisting solely of characters from the Base64 alphabet (a-z, A-Z, 0-9, +, /, =)
  5. After some time of the application running, new data is appended to the existing string in chunks of plenty characters at once. In several cases, it appears that data was inserted in chunks of 4096 characters length each, but I couldn't reproduce this behavior reliably and have also examples in which this rule seems not applicable.
  6. The first chunk of data starts with encoded y+O+srwgzR / decoded hex CB E3 BE B2 BC 20 CD 12.
  7. In many of the files, the character = is used. For all these files, = only occurs right before a string of 32 characters at the end of the file. These last 16 characters are also from the Base64 alphabet.

What I have tried so far:

  1. My first guess was to decode the Base64 string which I tried in following ways:

After decoding, I checked the result string but it was a (not human-readable) string of seemingly random characters for all cases.

  1. My second idea was that the encoded data could have been compressed. Therefore, after decoding with the Base64 decoder, I tried following decompressions:

Both approaches failed with an exception (e.g., for zlib: Incorrect header check) indicating that the decoded data is not compressed. Eventhough I would not expect log file data to be encrypted, I have no clue how to rule that out.

  1. Another idea was to check the inter-process communication between the application process and the logging process, hoping to find an indicator for what happens with the log messages before persisting them in the log file.My steps were based on these threads: 1, 2. Netstat did not reveal any ports of one of both processes, for STraceNT and accesschk I did not know how to manage all the data retrieved to make conclusions about shared memory use. I have to admit that I am not used to these tools and would need more guidance to use them if it makes sense for my problem.
  2. As suggested in the first comment, I compared the hex representation of the start of the decoded string with the list of known file signatures/magic bytes but found no match.

I'm thankful for any ideas on how to develop my assumptions further and on how to find a way to make the log-data human readable. If you need any more details, let me know!

Upvotes: 0

Views: 430

Answers (0)

Related Questions