Nini Michaels
Nini Michaels

Reputation: 340

How to identify text file format by its structure?

I have a few text file types with data such as product info, stock, supplier info etc. and they are all structured differently. There is no other identifier for the type except the structure itself (there are no headers, no filename convention etc.)

Some examples of these files:

(products and stocks)

2326 | 542212 | Bananas    | 00023 | 1 | pack
2326 | 297875 | Apples     | 00085 | 1 | bag
2326 | 028371 | Pineapple  | 00007 | 1 | can
...

(products and prices)

12556  Meat, pork        0098.57  
58521  Potatoes, mashed  0005.20     
43663  Chicken wings     0009.99  
...

(products and suppliers - here N is the separator)

03038N92388N9883929
28338N82367N2837912
23002N23829N9339211
...

(product information - multiple types of rows)

VIN|Mom & Pops|78 Haley str. 
PIN|BLT Bagel|5.79|FRESH
LID|0239382|283746
... (repeats this type of info for different products)

And several others. I want to make a function that identifies which of these types a given file is, using nothing but the content. Google has been no help, in part because I don't know what search term to use. Needless to say, "identify file type by content/structure" is of no help, it just gives me results on how to find jpgs, pdfs etc. It would be helpful if I saw some code that others wrote to deal with a similar problem.

What I have thought so far is to make a FileIdentifier class for each type, then when given a file try to parse it and if it doesn't work move on to the next type. But that seems error prone to me, and I would have to hardcode a lot of information. Also, what happens if another format comes along and is very similar to any of the existing ones, but has different information in the columns?

Upvotes: 1

Views: 1324

Answers (1)

uliwitness
uliwitness

Reputation: 8803

There really is no one-size-fits-all answer unless you can limit the file formats that can happen. You will always only be able to find a heuristic for identifying formats unless you can get whoever designs these formats to give it a unique identifier or you ask the user what format the file is.

That said, there are things you can do to improve your results, like make sure you try all instances of similar formats and then pick the best fit instead of the first match.

The general approach will always be the same: make each decode attempt as strictly as possible, and with as much knowledge about not just syntax, but also semantics. I. e. If you know an item can only contain one of 5 values, or numbers in a certain range, usethat knowledge for detection. Also, don‘t just call strtol() on a component and accept that, check that it parsed the entire string. If it didn‘t, either fail right there, or maintain a „confidence“ value and lower that if a file has any possibly invalid parts.

Then in the end, go through all parse results and pick the one with the highest confidence percentage. Or if you can‘t you can ask the user to pick between the most likely formats.

PS - The file command line tool on Unixes does something similar: It looks at the start of a file and identifies common sequences that indicate certain file formats.

Upvotes: 1

Related Questions