krisGoks
krisGoks

Reputation: 23

Need Java advice for handling billions of records present in unindexed files

I have 4 big .tab files one of which is 6GB and others are 10GB each. The 6GB file contains information about animals of a certain region, and the other 3 files contains other vital information that is related to each animal present in the 6GB file.

I am required to write a program that produces small data sets from these big files based on some user inputs.

I read animal's data from 6GB file line by line and if they pass certain criteria they are stored in an ArrayList, otherwise omitted.

And now for each animal in the ArrayList I need to go through the other 3 files over and over again in order to further filter them and finally produce the small data set which the used needs. But as of now it takes about 7 hours of run-time to fetch a small data-set of 1500 animal records. Main culprit is that for each animal I select into the ArrayList I need to look up the other 3 files multiple times at different steps of data extraction process

I have already written code in Java for this. But the program is incredibly slow. I have used buffered readers to access these files. But I am looking for other tools and techniques that I can use in Java and make this efficient and usable system.

I have considered pushing data in a SQL or NoSQL database, but I need expert advice to guide me in the right direction before I do something to improve the performance.

Thanks in advance

Upvotes: 1

Views: 122

Answers (1)

Tschallacka
Tschallacka

Reputation: 28742

Well, i'd go with a SQLite if you need portability or other database engine. That way you can disect the data into bitesize related portions.

You need to "digest" the data first so it becomes searchable and properly linked. So you'd make a table with animal names with an id, so if a user searches "cheetah" you can use the id of cheetah to link to the other tables of information.

And a cheetah belongs to continent afrika, countries x,y,z, is a type of cat, is a type of predator, is a type of carnivore, etc... and all those things should be linked together, etc.. I believe you'd reduce the size of the database significantly by just grouping and categorising a lot of the duplicated data and just linking it.

The hard work is identifying the duplicate data in the 6gb of data and grouping, categorising that. But when you are done, you'll have lightning fast searches compared to what you have now. But do seek help from someone who has designed their fair share of databases. You could try to ask for helpful tips on https://dba.stackexchange.com/ for which database type to choose from and how to set it up.

Upvotes: 2

Related Questions