Reputation: 1583

Deleting duplicate lines in a file using Java

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!

Upvotes: 27

Answers (15)

Nicholas DiPiazza

Reputation: 10595

These answers all rely on the file being small enough to store in memory.

If it is OK to sort the file, this is an algorithm that can be used on any sized file.

You need this library: https://github.com/lemire/externalsortinginjava

I assume you start with a file fileDumpCsvFileUnsorted and you will end up with a new file fileDumpCsvFileSorted that is sorted and has no dupes.

ExternalSort.sort(fileDumpCsvFileUnsorted, fileDumpCsvFileSorted);
int numDupes = 0;
File dupesRemoved = new File(fileDumpCsvFileSorted.getAbsolutePath() + ".nodupes");
String previousLine = null;
try (FileWriter fw = new FileWriter(dupesRemoved);
     BufferedWriter bw = new BufferedWriter(fw);
     FileReader fr = new FileReader(fileDumpCsvFileSorted);
     LineIterator lineIterator = new LineIterator(fr)
) {
  while (lineIterator.hasNext()) {
    String nextLine = lineIterator.nextLine();
    if (StringUtils.equals(nextLine, previousLine)) {
      ++numDupes;
      continue;
    }
    bw.write(String.format("%s%n", nextLine));
    previousLine = nextLine;
  }
}
logger.info("Removed {} dupes from {}", numDupes, fileDumpCsvFileSorted.getAbsolutePath());
FileUtils.deleteQuietly(fileDumpCsvFileSorted);
FileUtils.moveFile(dupesRemoved, fileDumpCsvFileSorted);

The file fileDumpCsvFileSorted is now created sorted with no dupes.

Upvotes: 0

Anit Chaudhary

Reputation: 1

void deleteDuplicates(File filename) throws IOException{
    @SuppressWarnings("resource")
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> lines = new LinkedHashSet<String>();
    String line;
    String delims = " ";
    System.out.println("Read the duplicate contents now and writing to file");
    while((line=reader.readLine())!=null){
        line = line.trim(); 
        StringTokenizer str = new StringTokenizer(line, delims);
        while (str.hasMoreElements()) {
            line = (String) str.nextElement();
            lines.add(line);
            BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
            for(String unique: lines){
                writer.write(unique+" ");               
            }
            writer.close();
        }
    }
    System.out.println(lines);
    System.out.println("Duplicate removal successful");
}

Upvotes: 0

AAW

Reputation: 33

I have made two assumptions for this efficient solution:

There is a Blob equivalent of line or we can process it as binary
We can save the offset or a pointer to start of each line.

Based on these assumptions solution is: 1.read a line, save the length in the hashmap as key , so we have lighter hashmap. Save the list as the entry in hashmap for all the lines having that length mentioned in key. Building this hashmap is O(n). While mapping the offsets for each line in the hashmap,compare the line blobs with all existing entries in the list of lines(offsets) for this key length except the entry -1 as offset.if duplicate found remove both lines and save the offset -1 in those places in list.

So consider the complexity and memory usage:

Hashmap memory ,space complexity = O(n) where n is number of lines

Time Complexity - if no duplicates but all equal length lines considering length of each line = m, consider the no of lines =n then that would be , O(n). Since we assume we can compare blob , the m does not matter. That was worst case.

In other cases we save on comparisons although we will have little extra space required in hashmap.

Additionally we can use mapreduce on server side to split the set and merge results later. And using length or start of line as the mapper key.

Upvotes: 0

Peter Lawrey

Reputation: 533690

A similar approach

public void stripDuplicatesFromFile(String filename) {
    IOUtils.writeLines(
        new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
        "\n", new FileOutputStream(filename + ".uniq"));
}

Upvotes: 10

Michael Myers

Reputation: 192015

Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n²) I/O work.

It would be something like this (ignoring exceptions):

public void stripDuplicatesFromFile(String filename) {
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
    String line;
    while ((line = reader.readLine()) != null) {
        lines.add(line);
    }
    reader.close();
    BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
    for (String unique : lines) {
        writer.write(unique);
        writer.newLine();
    }
    writer.close();
}

If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.

Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.

Upvotes: 40

Wim ten Brink

Reputation: 26682

Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:

Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
  Read Line.
  If not(Line in hashSet)
    Add Line to hashset.
    Write Line to output.
  End If.
End While.
Free hashset.
Close input.
Close output.

Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.

Upvotes: 17

mikek

Reputation: 1555

Does it matter in which order the lines come, and how many duplicates are you counting on seeing?

If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.

Upvotes: 0

fortran

Reputation: 76077

The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.

Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.

Upvotes: 1

phihag

Reputation: 288130

If the order does not matter, the simplest way is shell scripting:

<infile sort | uniq > outfile

Upvotes: 3

user44242

Reputation: 1178

There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.

For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.

The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.

The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing the result, just the line numbers to be deleted.

Then copy the original file line by line, ignoring the line numbers you have stored above.

Upvotes: 0

Simon Nickerson

Reputation: 43157

Read in the file, storing the line number and the line: O(n)
Sort it into alphabetical order: O(n log n)
Remove duplicates: O(n)
Sort it into its original line number order: O(n log n)

Upvotes: 2

foobarfuzzbizz

Reputation: 58685

If you could use UNIX shell commands you could do something like the following:

for(i = line 0 to end)
{
    sed 's/\$i//2g' ; deletes all repeats
}

This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.

Upvotes: 0

gustafc

Reputation: 28875

Something like this, perhaps:

BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
    lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
    out.println(line);

LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.

Upvotes: 4

Kevin Dungs

Reputation: 1577

Try a simple HashSet that stores the lines you have already read. Then iterate over the file. If you come across duplicates they are simply ignored (as a Set can only contain every element once).

Upvotes: 2

brabster

Reputation: 43580

You could use Set in the Collections library to store unique, seen values as you read the file.

Set<String> uniqueStrings = new HashSet<String>();

// read your file, looping on newline, putting each line into variable 'thisLine'

    uniqueStrings.add(thisLine);

// finish read

for (String uniqueString:uniqueStrings) {
  // do your processing for each unique String
  // i.e. System.out.println(uniqueString);
}

Upvotes: 3

Deleting duplicate lines in a file using Java

Answers (15)

Related Questions