Harbinger
Harbinger

Reputation: 762

Java - Millions of records, HashMap throws OutOfMemoryError

I'm reading a file to parse few of the fields of each record as a reference key and another field as the reference value. These keys and values are referred for another process. Hence, I chose a HashMap, so that I can get the values for each key, easily.

But, each of the file consists of tens of millions or records. Hence, the HashMap throws OutOfMemoryError. I hope increasing the heap memory will not be a good solution, if the input file in future grows.

For similar questions in SO, most have suggested to use a database. I fear I'll not be given option to use a DB. Is there any other way to handle the problem?

EDIT: I need to do this similar HashMap Loading for 4 such files :( I need all the four. Bcoz, If I dont find a matching entry for my input in the first Map, I need to find in second, then if there not, then third and finally in fourth.

Edit 2: The files I have sums up to, around 1 GB. EDIT 3:

034560000010000001750                                  
000234500010000100752                            
012340000010000300374

I have records like these in a file.. I need to have 03456000001000000 as key and 1750 as value.. for all the millions of records. I'll refer these keys and get the value for my another process.

Upvotes: 2

Views: 3554

Answers (7)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Using a database will not reduce memory cost or runtime per itself.

However, the default hashmaps may not be what you are looking for, depending on your data types. When used with primitive values such as Integers then java hashmaps have a massive memory overhead. In a HashMap<Integer, Integer>, every entry uses like 24+16+16 bytes. Unused entries (and the hashmap keeps up to half of them unused) take 4 bytes extra. So you can roughly estimate >56 bytes per int->int entry in Java HashMap<Integer, Integer>.

If you encode the integers as String, and we're talking maybe 6 digit numbers, that is likely 24 bytes for the underlying char[] array (16 bit characters; 12 bytes overhead for the array, sizes are a multiple of 8!), plus 16 bytes for the String object around (maybe 24, too). For key and value each. So that is then around 24+40+40, i.e. over 104 bytes per entry. (Update: as your keys are 17 characters in length, make this 24+62+40, i.e. 136 bytes)

If you used a primitive hashmap such as GNU Trove TIntIntHashMap, it would only take 8 bytes + unused, so lets estimate 16 bytes per entry, at least 6 times less memory. (Update: for TLongIntHashMap, estimate 12 bytes per entry, 24 bytes with overhead of unused buckets.)

Now you could also just store everything in a massive sorted list. This will allow you to perform a fast join operation, and you will lose much of the overhead of unused entries, and can probably process twice as many in much shorter time.

Oh, and if you know the valid value range, you can abuse an array as "hashmap".

I.e. if your valid keys are 0...999999, then just use an int[1000000] as storage, and write each entry into the appropriate row. Don't store the key at all - it's the offset in the array.

Last but not least, Java by default only uses 25% of your memory. You probably want to increase its memory limit.

Upvotes: 3

Stefano Sanfilippo
Stefano Sanfilippo

Reputation: 33046

Short answer: no. It's quite clear that you can't load your entire dataset in memory. You need a way to keep it on disk together with an index, so that you can access the relevant bits of the dataset without rescanning the whole file every time a new key is requested.

Essentially, a DBMS is a mechanism for handling (large) quantities of data: storing, retrieving, combining, filtering etc. They also provide caching for commonly used queries and responses. So anything you are going to do will be a (partial) reimplementation of what a DBMS already does.

I understand your concerns about having an external component to depend on, however note that a DBMS is not necessarily a server daemon. There are tiny DBMS which link with your program and keep all the dataset in a file, like SQLite does.,

Upvotes: 2

wassgren
wassgren

Reputation: 19211

Since you write that you fear that you will not be given the option to use a database some kind of embedded DB might be the answer. If it is impossible to keep everything in memory it must be stored somewhere else.

I believe that some kind of embedded database that uses the disk as storage might work. Examples include BerkeleyDB and Neo4j. Since both databases use a file index for fast lookups the memory load is lesser than if you keep the entire load in memory but they are still fast.

Upvotes: 0

Peter Lawrey
Peter Lawrey

Reputation: 533510

You have to either

a) have enough memory load to load the data into memory.

b) have to read the data from disk, with an index which is either in memory or not.

Whether you use a database or not the problem is much the same. If you don't have enough memory, you will see a dramatic drop in performance if you start randomly accessing the disk.

There are alternatives like Chronicle Map which use off heap and performs well up to double your main memory size so you won't get an out of memory error, however you still have problem that you can't store more data in memory than you have main memory.

Upvotes: 1

Master Slave
Master Slave

Reputation: 28519

The memory footprint depends on how you approach the file in java. A widely used solution is based on streaming the file using the Apache Commons IO LineIterator. Their recommended usage

 LineIterator it = FileUtils.lineIterator(file, "UTF-8");
 try {
   while (it.hasNext()) {
     String line = it.nextLine();
     // do something with line
   }
 } finally {
   it.close();
 }

Its an optimized approach, but if the file is too big, you can still end up with OutOfMemory

Upvotes: 0

Kelevandos
Kelevandos

Reputation: 7082

Such large data collections should be handled with a database. Java programs are limited in memory, varying from device to device. You provided no info about your program, but please remember that if it is run on different devices, some of them may have very little ram and will crash very quickly. DB (be it SQL or file-based) is a must when it comes to large-data programs.

Upvotes: 1

user2393256
user2393256

Reputation: 1150

You could try lazy loading it.

Upvotes: -1

Related Questions