user5045814
user5045814

Reputation:

Working with huge file (>10GB)

I was googling and didnt find answer. So I have a huge file (>10GB), that I cant store in memory. The words are divided with "|". I need to find top 100000 most frequently used phrases.

So I am going to read this file line by line using InputStream so I need memory only for 1 line. And then Im planning to parse line into phrases.

But how can I store the phrases? I want to use file for this (format: @Phrase@ @Count@). File structure can be like this:

Phrase | Count
"Phrase1" 17
"Phrase2" 5
"Phrase3" 6

Each time I get phrase I am finding it in file, if there is no such phrase, i put it to the end of file and set count to 1. Otherwise I increment count of this phrase.

Is it possible to do? I mean to write to a certain position in file? If so how can I do this? Maybe there is some libs? Or any other suggestions?

Upvotes: 1

Views: 185

Answers (2)

Andreas
Andreas

Reputation: 159215

Since your goal is finding equal values, sorting all the phrases will work, but since you don't have enough memory to store all the data at once, a disk-based merge-sort is likely your best option.

On Wikipedia, it's called an External merge sort:

One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM.

Upvotes: 2

victor
victor

Reputation: 812

Do not write to the file as you go along, instead you should keep a data structure with key value pairs where the key is the phrase and the value is the number of times it appears. Then once you have read through the input file in its entirety, and everything is counted and properly stored in your data structure, THEN and ONLY THEN should you output the contents of the data structure to a text file using your own self-imposed constraints.

Upvotes: 0

Related Questions