Reputation: 87
I have two files which are very large in size say 50000 lines each. I need to compare these two files and identify the changes. However, the catch is if a line is present at different position, it should not be shown as different.
For eg, consider this
File A.txt
xxxxx
yyyyy
zzzzz
File B.txt
zzzzz
xxxx
yyyyy
So if this is the content of the file. My code should give the output as xxxx(or both xxxx and xxxxx).
Ofcourse the easiest way would be storing each line of the file in a
List< String>
and comparing with the other
List< String>.
But this seems to be taking a lot of time. I have also tried using the DiffUtils in java. But it doesnt recognize the lines present in diferent line numbers as same. So is there any other algorithm that might help me?
Upvotes: 4
Views: 5482
Reputation: 1
I think this will be useful,
BufferedReader reader1 = new BufferedReader(new FileReader("C:\\file1.txt"));
BufferedReader reader2 = new BufferedReader(new FileReader("C:\\file2.txt"));
String line1 = reader1.readLine();
String line2 = reader2.readLine();
boolean areEqual = true;
int lineNum = 1;
while (line1 != null || line2 != null)
{
if(line1 == null || line2 == null)
{
areEqual = false;
break;
}
else if(! line1.equalsIgnoreCase(line2))
{
areEqual = false;
break;
}
line1 = reader1.readLine();
line2 = reader2.readLine();
lineNum++;
}
if(areEqual)
{
System.out.println("Two files have same content.");
}
else
{
System.out.println("Two files have different content. They differ at line "+lineNum);
System.out.println("File1 has "+line1+" and File2 has "+line2+" at line "+lineNum);
}
reader1.close();
reader2.close();
Upvotes: 0
Reputation: 26856
You need to keep track of the case where the same record might appear more than once in the files. For example, if a record appears twice in file A and once in file B, then you need to record that as an extra record.
Since we have to keep track of the number of occurrences, you need one of:
With a Multiset, you can add and remove records and it will keep track of the number of times the record has been added (a Set doesn't do that - it rejects an add of a record that is already there). With the Map approach, you have to do a little bit of work so that the integer tracks the number of occurrences. let's consider that approach (the MultiSet is simpler).
With the map, when we talk about 'adding' a record, you look to see if there is an entry for that String in the Map. if there is, replace the value with value+1 for that key. If there isn't, create an entry with the value of 1. When we talk about 'removing an entry', look for an entry for that key. If you find it, replace the value with value-1. If that reduces the value to 0, remove the entry.
The contents of the two Maps will give you the records that appeared in that file but not the other.
Doing this as we go along, rather than building the Maps up front, keeps the memory usage down, but probably doesn't have a big impact on performance.
Upvotes: 1
Reputation: 307
In general HashSet would be the best solution, but as we are dealing with strings there are two possible solutions:
saving one file as HashSet and trying to find the lines of other file in it.
saving one file as Trie and trying to find the lines of other file in it
In this post you can find comparison between HashSets and Tries How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?
Upvotes: 2
Reputation: 5423
probably using Set
is the easiest way:
Set<String> set1 = new HashSet<String>(FileUtils.readLines(file1));
Set<String> set2 = new HashSet<String>(FileUtils.readLines(file2));
Set<String> similars = new HashSet<String>(set1);
similars.retainAll(set2);
set1.removeAll(similars); //now set1 contains distinct lines in file1
set2.removeAll(similars); //now set2 contains distinct lines in file2
System.out.println(set1); //prints distinct lines in file1;
System.out.println(set2); //prints distinct lines in file2
Upvotes: 1
Reputation: 17
You can use FileUtils.contentEquals(file1, file2)
It will compare the contents of the 2 files.
Find more information here
Upvotes: -1
Reputation: 2595
Just use a byte comparison with BufferedReader. This will be the fastest way to compare two files. Read a byte block from one file and compare it with the byte block of the other file. First check if the file length is the same.
Or just use FileUtils.contentEquals(file1, file2);
from org.apache.commons.io.FileUtils
.
Upvotes: -1