Reputation: 622

Improving speed and memory consumption when handling ArrayList with 100 million elements

I work with text files with short strings in it (10 digits). The size of file is approx 1.5Gb, so the number of rows is reaching 100 millions.

Every day I get another file and need to extract new elements (tens of thousands a day).

What's the best approach to solve my problem?

I tried to load data in ArrayList - it takes around 20 seconds for each file, but substraction of arrays takes forever.

I use this code:

dataNew.removeAll(dataOld);

Tried to load data in HashSets - creation of HashSets is endless. The same with LinkedHashset.

Tried to load into ArrayLists and to sort only one of them

Collections.sort(dataNew);

but it didn't speed up the process of

dataNew.removeAll(dataOld);

Also memory consumption is rather high - sort() finishes only with heap of 15Gb (13Gb is not enough).

I've tried to use old good linux util diff and it finished the task in 76 minutes (while eating 8Gb of RAM).

So, my goal is to solve the problem in Java within 1 hour of processing time (or less, of course) and with consumption of 15Gb (or better 8-10Gb).

Any suggestions, please? Maybe I need not alphabetic sorting of ArrayList, but something else?

UPDATE: This is a country-wide list of invalid passports. It is published as a global list, so I need to extract delta by myself.

Data is unsorted and each row is unique. So I must compare 100M elements with 100M elements. Dataline is for example, "2404,107263". Converting to integer is not possible.

Interesting, when I increased maximum heap size to 16Gb

java -Xms5G -Xmx16G -jar utils.jar

loading to HashSet became fast (50 seconds for first file), but program gets killed by system Out-Of-Memory killer, as it eats enormous amounts of RAM while loading second file to second HashSet or ArrayList

My code is very simple:

List<String> setL = Files.readAllLines(Paths.get("filename"));
HashSet<String> dataNew = new HashSet<>(setL);

on second file the program gets

Killed

[1408341.392872] Out of memory: Kill process 20538 (java) score 489 or sacrifice child [1408341.392874] Killed process 20531 (java) total-vm:20177160kB, anon-rss:16074268kB, file-rss:0kB

UPDATE2:

Thanks for all your ideas!

Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)

RAM consumption became 3.6Gb and processing time only 40 seconds!

Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.

If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.

p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.

Upvotes: 4

Answers (8)

Alex Karasev

Reputation: 1128

The main problem in numerous resizing ArrayList when readAllLines() occurs. Better choice is LinkedList to insert data

try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
        List<String> result = new LinkedList<>();
        for (;;) {
            String line = reader.readLine();
            if (line == null)
                break;
            result.add(line);
        }
        return result;
    }

Upvotes: -1

Oleg Gritsak

Reputation: 622

Thank's for all your ideas!

Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)

RAM consumption became 3.6Gb and processing time only 40 seconds!

If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.

p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.

Upvotes: 3

Ivan Senic

Reputation: 649

The String object holding 11 characters (up to 12 in-fact) will have a size of 64 bytes (on 64bits Java with compressed oops). The only structure that can hold so much elements and be of a reasonable size is an array:

100,000,000 * (64b per String object + 4b per reference) = 6,800,000,000b ~ 6.3Gb

So you can immediately forget about Maps, Sets, etc as they introduce too much memory overhead.. But array is actually all you need. My approach would be:

Load the "old" data into an array, sort it (this should be fast enough)
Create a back-up array of primitive booleans with same size as the loaded array (you can use the BitSet here as well)
Read line by line from the new data file. Use binary search to check if the password data exists in the old data array. If the item exist mark it's index in the boolean array/bitset as true (you get back the index from the binary search). If the item does not exists just save it somewhere (array list can serve).
When all lines are processed remove from old array all the items that have false in boolean array/bitset (check by index of course). And finally add to the array all the new data you saved somewhere.
Optionally sort the array again and save to disk, so next time you load it you can skip the initial sorting.

This should be fast enough imo. Initial sort is O(n log(n)), while the binary search is O(log(n)) thus you should end up with (excluding final removal + adding which can be max 2n):

n log(n) (sort) + n log(n) (binary check for n elements) = 2 n log(n)

There would be other optimizations possible if you would explain more on the structure of that String you have (if there is some pattern or not).

Upvotes: 0

Serge Rogatch

Reputation: 15070

You can use a trie data structure for such cases: http://www.toptal.com/java/the-trie-a-neglected-data-structure The algorithm would be as follows:

Read the old file line by line and store each line in the trie.
Read the new file line by line and test each line whether it is in the trie: if it is not, then it is a newly added line.

A further memory optimization can take advantage that there are only 10 digits, so 4 bits is enough to store a digit (instead of 2 bytes per character in Java). You may need to adapt the trie data structure from one of the following links:

Upvotes: 0

matt

Reputation: 12346

I made a very simple spell checker, just checking if a word was in the dictionary was too slow for whole documents. I created a map structure, and it works great.

Map<String, List<String>> dictionary;

For the key, I use the first 2 letters of the word. The list has all the words that start with the key. To speed it up a bit more you can sort the list, then use a binary search to check for existence. I'm not sure the optimum length of key, and if your key gets too long you could nest the maps. Eventually it becomes a tree. A trie structure is possibly the best actually.

Upvotes: 0

tucuxi

Reputation: 17945

Use a database; to keep things simple, use a Java-embedded database (Derby, HSQL, H2, ...). With that much information, you can really benefit from standard DB caching, time-efficient storage, and querying. Your pseudo-code would be:

if first use,
   define new one-column table, setting column as primary-key
   iterate through input records, for each:
       insert record into table
otherwise
   open database with previous records
   iterate through input records, for each:
       lookup record in DB, update/report as required

Alternatively, you can do even less work if you use existing "table-diff" libraries, such as DiffKit - from their tutorial:

java -jar ../diffkit-app.jar -demoDB
Then configure a connection to this demo database within your favorite JDBC enabled database browser [...] Your DB browser will show you the tables TEST10_LHS_TABLE and TEST10_RHS_TABLE (amongst others) populated with the data values from the corresponding CSV files.

That is: DiffKit does essentially what I proposed above, loading files into database tables (they use embedded H2) and then comparing these tables through DB queries.

They accept input as CSV files; but conversion from your textual input to their CSV can be done in a streaming fashion in less than 10 lines of code. And then you just need to call their jar to do the diff, and you would get the results as tables in their embedded DB.

Upvotes: 0

Petr Janeček

Reputation: 38444

First of all, don't do Files.readAllLines(Paths.get("filename")) and then pass everything to a Set, that holds unnecesserily huge amounts of data. Try to hold as few lines as possible at all times.

Read the files line-by-line and process as you go. This immediately cuts your memory usage by a lot.

Set<String> oldData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("oldData"))) {
    for (String line = reader.readLine(); line != null; line = reader.readLine()) {
        // process your line, maybe add to the Set for the old data?
        oldData.add(line);
    }
}

Set<String> newData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("newData"))) {
    for (String line = reader.readLine(); line != null; line = reader.readLine()) {
        // Is it enough just to remove from old data so that you'll end up with only the difference between old and new?
        boolean oldRemoved = oldData.remove(line);
        if (!oldRemoved) {
            newData.add(line);
        }
    }
}

You'll end up with two sets containing only the data that is present in the old, or the new dataset, respectively.

Second of all, try to presize your containers if at all possible. Their size (usually) doubles when they reach their capacity, and that could potentially create a lot of overhead when dealing with big collections.

Also, if your data are numbers, you could just use a long and hold that instead of trying to hold instances of String? There's a lot of collection libraries that enable you to do this, e.g. Koloboke, HPPC, HPPC-RT, GS Collections, fastutil, Trove. Even their collections for Objects might serve you very well as a standard HashSet has a lot of unnecessary object allocation.

Upvotes: 3

Avis

Reputation: 2237

Pls split the strings into two and whatever part (str1 or str2) is repeated most use the intern() on it so to save duplication os same String again in Heap. Here i used intern() on both part just to show the sample but dont use it unless they are repeating most.

Set<MyObj> lineData = new HashSet<MyObj>();
String line = null;
BufferedReader bufferedReader = new BufferedReader(new FileReader(file.getAbsoluteFile()));
while((line = bufferedReader.readLine()) != null){
    String[] data = line.split(",");
    MyObj myObj = new MyObj();
    myObj.setStr1(data[0].intern());
    myObj.setStr1(data[1].intern());
    lineData.add(myObj);
}

public class MyObj {

    private String str1;
    private String str2;

    public String getStr1() {
        return str1;
    }

    public void setStr1(String str1) {
        this.str1 = str1;
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + ((str1 == null) ? 0 : str1.hashCode());
        result = prime * result + ((str2 == null) ? 0 : str2.hashCode());
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Test1 other = (Test1) obj;
        if (str1 == null) {
            if (other.str1 != null)
                return false;
        } else if (!str1.equals(other.str1))
            return false;
        if (str2 == null) {
            if (other.str2 != null)
                return false;
        } else if (!str2.equals(other.str2))
            return false;
        return true;
    }

    public String getStr2() {
        return str2;
    }

    public void setStr2(String str2) {
        this.str2 = str2;
    }

}

Upvotes: 0

Improving speed and memory consumption when handling ArrayList with 100 million elements

Answers (8)

Related Questions