Watt
Watt

Reputation: 3164

How to delete duplicate/aggregate rows faster in a file using Java (no DB)

I have a 2GB big text file, it has 5 columns delimited by tab. A row will be called duplicate only if 4 out of 5 columns matches.

Right now, I am doing dduping by first loading each coloumn in separate List , then iterating through lists, deleting the duplicate rows as it encountered and aggregating.

The problem: it is taking more than 20 hours to process one file. I have 25 such files to process.

Can anyone please share their experience, how they would go about doing such dduping?

This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.

Here is my pseudo code (roughly)

Iterate over the rows
  i=current_row_no.    
    Iterate over the row no. i+1 to last_row
                    if(col1 matches  //find duplicate
                        && col2 matches
                        && col3 matches  
                        && col4 matches)
                        { 
                           col5List.set(i,get col5); //aggregate 
                        }

Duplicate example

A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]

Upvotes: 2

Views: 2065

Answers (5)

mihi
mihi

Reputation: 6735

The solutions already posted are nice if you have enough (free) RAM. As Java tends to "still work" even if it is heavily swapping, make sure you don't have too much swap activity if you presume RAM could have been the limiting factor.

An easy "throwaway" solution in case you really have too little RAM is partitioning the file into multiple files first, depending on data in the first four columns (for example, if the third column values are more or less uniformly distributed, partition by the last two digits of that column). Just go over the file once, and write the records as you read them into 100 different files, depending on the partition value. This will need minimal amount of RAM, and then you can process the remaining files (that are only about 20MB each, if the partitioning values were well distributed) with a lot less required memory, and concatenate the results again.

Just to be clear: If you have enough RAM (don't forget that the OS wants to have some for disk cache and background activity too), this solution will be slower (maybe even by a factor of 2, since twice the amount of data needs to be read and written), but in case you are swapping to death, it might be a lot faster :-)

Upvotes: 0

Eric Grunzke
Eric Grunzke

Reputation: 1657

A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:

public void aggregate() throws Exception
  {
    BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));

    // Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
    Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);

    while (bigFile.ready())
    {
      String line = bigFile.readLine();
      int lastTab = line.lastIndexOf('\t');
      String firstFourColumns = line.substring(0, lastTab);

      // See if the map already contains an entry for the first 4 columns
      HashSet<String> set = map.get(firstFourColumns);

      // If set is null, then the map hasn't seen these columns before
      if (set==null)
      {
        // Make a new Set (for aggregation), and add it to the map
        set = new HashSet<String>();
        map.put(firstFourColumns, set);
      }

      // At this point we either found set or created it ourselves
      String lastColumn = line.substring(lastTab+1);
      set.add(lastColumn);
    }
    bigFile.close();

    // A demo that shows how to iterate over the map and set structures
    for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
    {
      String firstFourColumns = entry.getKey();
      System.out.print(firstFourColumns + "=");

      HashSet<String> aggregatedLastColumns = entry.getValue();
      for (String column : aggregatedLastColumns)
      {
        System.out.print(column + ",");
      }
      System.out.println("");
    }
  }

A few points:

  • The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.
  • If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.
  • This implementation will use a lot of memory. If your text file is 2GB, then you'll probably need a lot of RAM in the jvm. You can add the jvm arg -Xmx4096m to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.
  • This is also a parallelizable problem, so if you're desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]

Upvotes: 3

kozyr
kozyr

Reputation: 1254

I would do something similar to Eric's solution, but instead of storing the actual strings in the HashMap, I'd just store line numbers. So for a particular four column hash, you'd store a list of line numbers which hash to that value. And then on a second path through the data, you can remove the duplicates at those line numbers/add the +x as needed.

This way, your memory requirements will be a LOT smaller.

Upvotes: 0

Peter Lawrey
Peter Lawrey

Reputation: 533790

I would use a HashSet of the records. This can lead to an O(n) timing instead of O(n^2). You can create a class which has each of the fields with one instance per row.

You need to have a decent amount of memory, but 16 to 32 GB is pretty cheap these days.

Upvotes: 1

Paul Tomblin
Paul Tomblin

Reputation: 182840

I would sort the whole list on the first four columns, and then traverse through the list knowing that all the duplicates are together. This would give you O(NlogN) for the sort and O(N) for the traverse, rather than O(N^2) for your nested loops.

Upvotes: 1

Related Questions