What is the fastest file / way to parse a large data file?

Question

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...

Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.

Now that leaves me with a few options:

API - No budget for paid services, free ones are not exactly reliable.

Upload Parse-able file - Favorable option as I like the certainty that the data will always be there. So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.

However, now that I have the option to choose how to format and access the data, the question is:

What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?

The best way being the fastest, and least resource hungry method.

Valid options:

TXT file, tab delimited
XML file Static
Java Class with Tons of enums

I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??

Thank you very much guys !! Please provide opinions, all and anything is appreciated !

icza · Accepted Answer

Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.

In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.

Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.

The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.

If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).

You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.

Improvements on the data file

An alternative to having the static resource file as a text/CSV file or a serialized Map data file would be to have it as a binary data file where you could create your own custom file format.

Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.

This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).

What is the fastest file / way to parse a large data file?

Answers (2)

Related Questions