Reputation: 622
I work with text files with short strings in it (10 digits). The size of file is approx 1.5Gb, so the number of rows is reaching 100 millions.
Every day I get another file and need to extract new elements (tens of thousands a day).
What's the best approach to solve my problem?
I tried to load data in ArrayList - it takes around 20 seconds for each file, but substraction of arrays takes forever.
I use this code:
dataNew.removeAll(dataOld);
Tried to load data in HashSets - creation of HashSets is endless. The same with LinkedHashset.
Tried to load into ArrayLists and to sort only one of them
Collections.sort(dataNew);
but it didn't speed up the process of
dataNew.removeAll(dataOld);
Also memory consumption is rather high - sort() finishes only with heap of 15Gb (13Gb is not enough).
I've tried to use old good linux util diff and it finished the task in 76 minutes (while eating 8Gb of RAM).
So, my goal is to solve the problem in Java within 1 hour of processing time (or less, of course) and with consumption of 15Gb (or better 8-10Gb).
Any suggestions, please? Maybe I need not alphabetic sorting of ArrayList, but something else?
UPDATE: This is a country-wide list of invalid passports. It is published as a global list, so I need to extract delta by myself.
Data is unsorted and each row is unique. So I must compare 100M elements with 100M elements. Dataline is for example, "2404,107263". Converting to integer is not possible.
Interesting, when I increased maximum heap size to 16Gb
java -Xms5G -Xmx16G -jar utils.jar
loading to HashSet became fast (50 seconds for first file), but program gets killed by system Out-Of-Memory killer, as it eats enormous amounts of RAM while loading second file to second HashSet or ArrayList
My code is very simple:
List<String> setL = Files.readAllLines(Paths.get("filename"));
HashSet<String> dataNew = new HashSet<>(setL);
on second file the program gets
Killed
[1408341.392872] Out of memory: Kill process 20538 (java) score 489 or sacrifice child [1408341.392874] Killed process 20531 (java) total-vm:20177160kB, anon-rss:16074268kB, file-rss:0kB
UPDATE2:
Thanks for all your ideas!
Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.
Upvotes: 4
Views: 3542
Reputation: 1128
The main problem in numerous resizing ArrayList
when readAllLines()
occurs. Better choice is LinkedList
to insert data
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
List<String> result = new LinkedList<>();
for (;;) {
String line = reader.readLine();
if (line == null)
break;
result.add(line);
}
return result;
}
Upvotes: -1
Reputation: 622
Thank's for all your ideas!
Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.
Upvotes: 3
Reputation: 649
The String object holding 11 characters (up to 12 in-fact) will have a size of 64 bytes (on 64bits Java with compressed oops). The only structure that can hold so much elements and be of a reasonable size is an array:
100,000,000 * (64b per String object + 4b per reference) = 6,800,000,000b ~ 6.3Gb
So you can immediately forget about Maps, Sets, etc as they introduce too much memory overhead.. But array is actually all you need. My approach would be:
false
in boolean array/bitset (check by index of course). And finally add to the array all the new data you saved somewhere. This should be fast enough imo. Initial sort is O(n log(n)), while the binary search is O(log(n)) thus you should end up with (excluding final removal + adding which can be max 2n):
n log(n) (sort) + n log(n) (binary check for n elements) = 2 n log(n)
There would be other optimizations possible if you would explain more on the structure of that String you have (if there is some pattern or not).
Upvotes: 0
Reputation: 15070
You can use a trie data structure for such cases: http://www.toptal.com/java/the-trie-a-neglected-data-structure The algorithm would be as follows:
A further memory optimization can take advantage that there are only 10 digits, so 4 bits is enough to store a digit (instead of 2 bytes per character in Java). You may need to adapt the trie data structure from one of the following links:
Upvotes: 0
Reputation: 12346
I made a very simple spell checker, just checking if a word was in the dictionary was too slow for whole documents. I created a map structure, and it works great.
Map<String, List<String>> dictionary;
For the key, I use the first 2 letters of the word. The list has all the words that start with the key. To speed it up a bit more you can sort the list, then use a binary search to check for existence. I'm not sure the optimum length of key, and if your key gets too long you could nest the maps. Eventually it becomes a tree. A trie structure is possibly the best actually.
Upvotes: 0
Reputation: 17945
Use a database; to keep things simple, use a Java-embedded database (Derby, HSQL, H2, ...). With that much information, you can really benefit from standard DB caching, time-efficient storage, and querying. Your pseudo-code would be:
if first use,
define new one-column table, setting column as primary-key
iterate through input records, for each:
insert record into table
otherwise
open database with previous records
iterate through input records, for each:
lookup record in DB, update/report as required
Alternatively, you can do even less work if you use existing "table-diff" libraries, such as DiffKit - from their tutorial:
java -jar ../diffkit-app.jar -demoDB
Then configure a connection to this demo database within your favorite JDBC enabled database browser [...] Your DB browser will show you the tables TEST10_LHS_TABLE and TEST10_RHS_TABLE (amongst others) populated with the data values from the corresponding CSV files.
That is: DiffKit does essentially what I proposed above, loading files into database tables (they use embedded H2) and then comparing these tables through DB queries.
They accept input as CSV files; but conversion from your textual input to their CSV can be done in a streaming fashion in less than 10 lines of code. And then you just need to call their jar to do the diff, and you would get the results as tables in their embedded DB.
Upvotes: 0
Reputation: 38444
First of all, don't do Files.readAllLines(Paths.get("filename"))
and then pass everything to a Set
, that holds unnecesserily huge amounts of data. Try to hold as few lines as possible at all times.
Read the files line-by-line and process as you go. This immediately cuts your memory usage by a lot.
Set<String> oldData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("oldData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// process your line, maybe add to the Set for the old data?
oldData.add(line);
}
}
Set<String> newData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("newData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// Is it enough just to remove from old data so that you'll end up with only the difference between old and new?
boolean oldRemoved = oldData.remove(line);
if (!oldRemoved) {
newData.add(line);
}
}
}
You'll end up with two sets containing only the data that is present in the old, or the new dataset, respectively.
Second of all, try to presize your containers if at all possible. Their size (usually) doubles when they reach their capacity, and that could potentially create a lot of overhead when dealing with big collections.
Also, if your data are numbers, you could just use a long
and hold that instead of trying to hold instances of String
? There's a lot of collection libraries that enable you to do this, e.g. Koloboke, HPPC, HPPC-RT, GS Collections, fastutil, Trove. Even their collections for Objects
might serve you very well as a standard HashSet
has a lot of unnecessary object allocation.
Upvotes: 3
Reputation: 2237
Pls split the strings into two and whatever part (str1 or str2) is repeated most use the intern() on it so to save duplication os same String again in Heap. Here i used intern() on both part just to show the sample but dont use it unless they are repeating most.
Set<MyObj> lineData = new HashSet<MyObj>();
String line = null;
BufferedReader bufferedReader = new BufferedReader(new FileReader(file.getAbsoluteFile()));
while((line = bufferedReader.readLine()) != null){
String[] data = line.split(",");
MyObj myObj = new MyObj();
myObj.setStr1(data[0].intern());
myObj.setStr1(data[1].intern());
lineData.add(myObj);
}
public class MyObj {
private String str1;
private String str2;
public String getStr1() {
return str1;
}
public void setStr1(String str1) {
this.str1 = str1;
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((str1 == null) ? 0 : str1.hashCode());
result = prime * result + ((str2 == null) ? 0 : str2.hashCode());
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Test1 other = (Test1) obj;
if (str1 == null) {
if (other.str1 != null)
return false;
} else if (!str1.equals(other.str1))
return false;
if (str2 == null) {
if (other.str2 != null)
return false;
} else if (!str2.equals(other.str2))
return false;
return true;
}
public String getStr2() {
return str2;
}
public void setStr2(String str2) {
this.str2 = str2;
}
}
Upvotes: 0