Petro
Petro

Reputation: 3652

How can I optimize this HashMap with 42,000 keys

I have a csv file with 42,000 lines in the following pattern

03055,Milford,NH
03057,Mont Vernon,NH
03060,Nashua,NH

I tried to store the data in a HashMap using the zipcode as a key, like

while ((line = stream_in.readLine())!=null) {
    LocationBean temp_location_bean = new LocationBean();
    String line_trimmed = line.trim();
    String[] line_chunked = line_trimmed.split(",",4);
    temp_location_bean.setZip_code(line_chunked[0]);
    temp_location_bean.setCity(line_chunked[1]);
    temp_location_bean.setState(line_chunked[2]);
    this.locations_as_beans_list.put(zip_code, temp_location_bean);
}

But when I go to do a lookup:

 for(Map.Entry<String, LocationBean> location_object : this.locations_as_beans_list.entrySet())
 {
     LocationBean temp_location_bean = location_object.getValue();
     if (params[0].matches(temp_location_bean.getZip_code())) {
         master_location = temp_location_bean.getCity() + "," 
             + temp_location_bean.getState()
             + ", (" + temp_location_bean.getZip_code() +")";
     }
 }

It takes over 20 seconds.... Shouldn't the performance be relatively quick? How can I improve the performance here?

tl;dr how can I optimize the reads in this example?

Upvotes: 4

Views: 521

Answers (5)

Elliott Frisch
Elliott Frisch

Reputation: 201439

If you're looking for performance then you shouldn't iterate the entrySet to lookup a keyed zipcode. Instead, you might use the HashMap and get the value by its' key. Like,

LocationBean temp_location_bean = this.locations_as_beans_list.get(params[0]);
if (temp_location_bean != null) {
    master_location = temp_location_bean.getCity() + "," 
            + temp_location_bean.getState() 
            + ", (" + temp_location_bean.getZip_code() +")";
}

Upvotes: 5

卢声远 Shengyuan Lu
卢声远 Shengyuan Lu

Reputation: 32004

If it is necessary to use regular expression based lookup, HashMap is not proper data structure to use. List can be a choice, because you have to use regular expression to match element in a loop.

My suggestion:

You could split your large data set into several Lists, and use multiple thread to search the lists respectively, and collect the results.

PLUS: MapReduce is little bit heavy for handling 40k data in a single machine I think.

Upvotes: 1

CuriousMind
CuriousMind

Reputation: 3173

The fundamental problem with the approach described in the question is, iterating over map and comparing every entry with query field. That's Wrong. HashMaps are primarily not meant for iteration and are optimized for search based on key. So the simplest trick to quickly gain performance, is to use the key and directly retrieve value using get method of HashMap (Note the keys are hashed and hence calling get method with key will enable fast lookup).

If you want to take one more step, you should look for specialized libraries like Javolution. The library makes sure that, instead of creating EntrySet for every item in the HashMap, it just stores entry using hashed keys. This results in significant improvement in memory as well as performance (no new object creation for every entry).

Upvotes: 4

Yair Zaslavsky
Yair Zaslavsky

Reputation: 4137


Several suggestions here -

A. Use something like Sharding - divide your data into several maps, run threads, and collect the result (think of it as a nice exercise in MapReduce

B. Matches - why are using matches? there is performance hit there.
Do you REALLY need to use something as generic is matchers? write more specific code for the matching algorithm

C. In your EntrySet loop , where do you use getKey() ? why not just perform iteration over the values (look at this method)

Upvotes: 0

Vinayak Pingale
Vinayak Pingale

Reputation: 1315

There can be many ways you can achieve performance optimization. Here the question is not about whether you can achieve it through your data structure or data elements or data parsing.

There are various point you should remeber when optimization comes into picture, acheiving a performance boost is one of the most vital concern.

1. Reading of file - BufferedReader will take a constant amount of time of 6/7 seconds to parse 878 MB file. How can you reduce it ?

a. You can go through RandomAccessChannel API in java which has reduced it to 0.16/0.19 seconds for the same file.

b. Asynchronous File Reads on a particular file.

2. Dealing with data processing

a. using Runtime available processor API's you can get the number of processor available on your particular machine and spawn that many threads for data processing.

b. Multi threading play's a major role in achieving performance

The above mentioned are few point you can spend time on to reduce the performcance

Upvotes: 4

Related Questions