Reputation: 3164
I have a 2GB big text file, it has 5 columns delimited by tab. A row will be called duplicate only if 4 out of 5 columns matches.
Right now, I am doing dduping by first loading each coloumn in separate List , then iterating through lists, deleting the duplicate rows as it encountered and aggregating.
The problem: it is taking more than 20 hours to process one file. I have 25 such files to process.
Can anyone please share their experience, how they would go about doing such dduping?
This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.
Here is my pseudo code (roughly)
Iterate over the rows
i=current_row_no.
Iterate over the row no. i+1 to last_row
if(col1 matches //find duplicate
&& col2 matches
&& col3 matches
&& col4 matches)
{
col5List.set(i,get col5); //aggregate
}
Duplicate example
A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]
Upvotes: 2
Views: 2065
Reputation: 6735
The solutions already posted are nice if you have enough (free) RAM. As Java tends to "still work" even if it is heavily swapping, make sure you don't have too much swap activity if you presume RAM could have been the limiting factor.
An easy "throwaway" solution in case you really have too little RAM is partitioning the file into multiple files first, depending on data in the first four columns (for example, if the third column values are more or less uniformly distributed, partition by the last two digits of that column). Just go over the file once, and write the records as you read them into 100 different files, depending on the partition value. This will need minimal amount of RAM, and then you can process the remaining files (that are only about 20MB each, if the partitioning values were well distributed) with a lot less required memory, and concatenate the results again.
Just to be clear: If you have enough RAM (don't forget that the OS wants to have some for disk cache and background activity too), this solution will be slower (maybe even by a factor of 2, since twice the amount of data needs to be read and written), but in case you are swapping to death, it might be a lot faster :-)
Upvotes: 0
Reputation: 1657
A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:
public void aggregate() throws Exception
{
BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));
// Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);
while (bigFile.ready())
{
String line = bigFile.readLine();
int lastTab = line.lastIndexOf('\t');
String firstFourColumns = line.substring(0, lastTab);
// See if the map already contains an entry for the first 4 columns
HashSet<String> set = map.get(firstFourColumns);
// If set is null, then the map hasn't seen these columns before
if (set==null)
{
// Make a new Set (for aggregation), and add it to the map
set = new HashSet<String>();
map.put(firstFourColumns, set);
}
// At this point we either found set or created it ourselves
String lastColumn = line.substring(lastTab+1);
set.add(lastColumn);
}
bigFile.close();
// A demo that shows how to iterate over the map and set structures
for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
{
String firstFourColumns = entry.getKey();
System.out.print(firstFourColumns + "=");
HashSet<String> aggregatedLastColumns = entry.getValue();
for (String column : aggregatedLastColumns)
{
System.out.print(column + ",");
}
System.out.println("");
}
}
A few points:
-Xmx4096m
to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.Upvotes: 3
Reputation: 1254
I would do something similar to Eric's solution, but instead of storing the actual strings in the HashMap, I'd just store line numbers. So for a particular four column hash, you'd store a list of line numbers which hash to that value. And then on a second path through the data, you can remove the duplicates at those line numbers/add the +x as needed.
This way, your memory requirements will be a LOT smaller.
Upvotes: 0
Reputation: 533790
I would use a HashSet of the records. This can lead to an O(n) timing instead of O(n^2). You can create a class which has each of the fields with one instance per row.
You need to have a decent amount of memory, but 16 to 32 GB is pretty cheap these days.
Upvotes: 1
Reputation: 182840
I would sort the whole list on the first four columns, and then traverse through the list knowing that all the duplicates are together. This would give you O(NlogN) for the sort and O(N) for the traverse, rather than O(N^2) for your nested loops.
Upvotes: 1