saru10
saru10

Reputation: 87

What is the fastest way to compare two text files, not counting moved lines as different

I have two files which are very large in size say 50000 lines each. I need to compare these two files and identify the changes. However, the catch is if a line is present at different position, it should not be shown as different.

For eg, consider this
File A.txt

xxxxx
yyyyy
zzzzz    

File B.txt

zzzzz
xxxx
yyyyy  

So if this is the content of the file. My code should give the output as xxxx(or both xxxx and xxxxx).

Ofcourse the easiest way would be storing each line of the file in a

List< String>

and comparing with the other

List< String>.

But this seems to be taking a lot of time. I have also tried using the DiffUtils in java. But it doesnt recognize the lines present in diferent line numbers as same. So is there any other algorithm that might help me?

Upvotes: 4

Views: 5482

Answers (7)

Jai Kumar
Jai Kumar

Reputation: 1

I think this will be useful,

   BufferedReader reader1 = new BufferedReader(new FileReader("C:\\file1.txt"));

    BufferedReader reader2 = new BufferedReader(new FileReader("C:\\file2.txt"));

    String line1 = reader1.readLine();

    String line2 = reader2.readLine();

    boolean areEqual = true;

    int lineNum = 1;

    while (line1 != null || line2 != null)
    {
        if(line1 == null || line2 == null)
        {
            areEqual = false;

            break;
        }
        else if(! line1.equalsIgnoreCase(line2))
        {
            areEqual = false;

            break;
        }

        line1 = reader1.readLine();

        line2 = reader2.readLine();

        lineNum++;
    }

    if(areEqual)
    {
        System.out.println("Two files have same content.");
    }
    else
    {
        System.out.println("Two files have different content. They differ at line "+lineNum);

        System.out.println("File1 has "+line1+" and File2 has "+line2+" at line "+lineNum);
    }

    reader1.close();

    reader2.close();

Upvotes: 0

DJClayworth
DJClayworth

Reputation: 26856

You need to keep track of the case where the same record might appear more than once in the files. For example, if a record appears twice in file A and once in file B, then you need to record that as an extra record.

Since we have to keep track of the number of occurrences, you need one of:

  1. A Multiset
  2. A Map from record to Integer e.g. Map

With a Multiset, you can add and remove records and it will keep track of the number of times the record has been added (a Set doesn't do that - it rejects an add of a record that is already there). With the Map approach, you have to do a little bit of work so that the integer tracks the number of occurrences. let's consider that approach (the MultiSet is simpler).

With the map, when we talk about 'adding' a record, you look to see if there is an entry for that String in the Map. if there is, replace the value with value+1 for that key. If there isn't, create an entry with the value of 1. When we talk about 'removing an entry', look for an entry for that key. If you find it, replace the value with value-1. If that reduces the value to 0, remove the entry.

  1. Create a Map for each file.
  2. Read a record for one of the files
  3. Check to see if that record exists in the other Map.
  4. If it exists in the other Map, remove that entry (see above for what that means)
  5. If it doesn't exist, add it to the Map for this file (see above)
  6. Repeat until end, alternating files.

The contents of the two Maps will give you the records that appeared in that file but not the other.

Doing this as we go along, rather than building the Maps up front, keeps the memory usage down, but probably doesn't have a big impact on performance.

Upvotes: 1

pointer
pointer

Reputation: 307

In general HashSet would be the best solution, but as we are dealing with strings there are two possible solutions:

  1. saving one file as HashSet and trying to find the lines of other file in it.

  2. saving one file as Trie and trying to find the lines of other file in it

In this post you can find comparison between HashSets and Tries How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

Upvotes: 2

nafas
nafas

Reputation: 5423

probably using Set is the easiest way:

Set<String> set1 = new HashSet<String>(FileUtils.readLines(file1));

Set<String> set2 = new HashSet<String>(FileUtils.readLines(file2));


Set<String> similars = new HashSet<String>(set1);

similars.retainAll(set2);

set1.removeAll(similars); //now set1 contains distinct lines in file1
set2.removeAll(similars); //now set2 contains distinct lines in file2
System.out.println(set1); //prints distinct lines in file1;
System.out.println(set2); //prints distinct lines in file2

Upvotes: 1

vbh
vbh

Reputation: 17

You can use FileUtils.contentEquals(file1, file2)

It will compare the contents of the 2 files.

Find more information here

Upvotes: -1

marc3l
marc3l

Reputation: 2595

Just use a byte comparison with BufferedReader. This will be the fastest way to compare two files. Read a byte block from one file and compare it with the byte block of the other file. First check if the file length is the same.

Or just use FileUtils.contentEquals(file1, file2); from org.apache.commons.io.FileUtils.

Upvotes: -1

Pandatyr
Pandatyr

Reputation: 294

You could try parsing your first file first, storing all of the lines in a HashMap and then checking whether there is a mapping present for each of the lines of the second file.

This is still O(n), though.

Upvotes: -1

Related Questions