Reputation: 174
This is the code I have written to perform a validation mechanism for comparing 2 files. I want to know is there a way to write it in a more performing way, because both of my files can have millions of records in it and this I believe will be slow in those cases.
I am thinking of adding a Hash map, every time I get occurrence of a line in the file, I will add +1 to key value. If not, the value of the key remains 1. If the record exists in the other map of file 2 , then I remove it from first map. If it doesn't, then I add it to the map. This goes alternation files till end.
I don't do a line by line comparison because order of the lines may be different in both files.
public static void main(String[] args) throws Exception {
BufferedReader br1 = null;
BufferedReader br2 = null;
BufferedWriter br3 = null;
String sCurrentLine;
int linelength;
List<String> list1 = new ArrayList<String>();
List<String> list2 = new ArrayList<String>();
List<String> unexpectedrecords = new ArrayList<String>();
br1 = new BufferedReader(new FileReader("expected.txt"));
br2 = new BufferedReader(new FileReader("actual.txt"));
while ((sCurrentLine = br1.readLine()) != null) {
list1.add(sCurrentLine);
}
while ((sCurrentLine = br2.readLine()) != null) {
list2.add(sCurrentLine);
}
List<String> expectedrecords = new ArrayList<String>(list1);
List<String> actualrecords = new ArrayList<String>(list2);
if (expectedrecords.size() > actualrecords.size()) {
linelength = expectedrecords.size();
} else {
linelength = actualrecords.size();
}
for (int i = 0; i < linelength; i++) {
if (actualrecords.contains(expectedrecords.get(i))) {
actualrecords.remove(expectedrecords.get(i));
} else {
unexpectedrecords.add(actualrecords.get(i));
}
}
br3 = new BufferedWriter(new FileWriter(new File("c.txt")));
br3.write("Records which are not present in actual");
for (int x = 0; x < unexpectedrecords.size(); x++) {
br3.write(unexpectedrecords.get(x));
br3.newLine();
}
br3.write("Records which are in actual but no present in expected");
for (int i = 0; i < actualrecords.size(); i++) {
br3.write(actualrecords.get(i));
br3.newLine();
}
br3.flush();
br3.close();
}
Upvotes: 0
Views: 8436
Reputation: 23
On Unix/Linux computers, you can just call diff
, which has been optimized for speed and memory usage.
The call looks like
String listFileDiffs = executeDiff(filenameWithPath1, filenameWithPath2);
The method is implemented by:
private String executeDiff(String filenameWithPath1, String filenameWithPath2) {
StringBuffer output = new StringBuffer();
Process p0;
Process p1;
Process p2;
try {
p0 = Runtime.getRuntime().exec("sort " + filenameWithPath1 + " > /tmp/sort1file");
p0.waitFor();
p1 = Runtime.getRuntime().exec("sort " + filenameWithPath2 + " > /tmp/sort2file");
p1.waitFor();
p2 = Runtime.getRuntime().exec("diff " + "/tmp/sort1file" + " " + "/tmp/sort2file");
p2.waitFor();
BufferedReader reader =
new BufferedReader(new InputStreamReader(p2.getInputStream()));
String line = "";
while ((line = reader.readLine())!= null) {
output.append(line + "\n");
}
} catch (Exception e) {
LOG.error("Error: executeCommand ", e);
}
return output.toString();
}
You can add flags to diff
in order to give more information regarding all file differences found.
The solution has been adapted to take into account the random order of the lines in each file. The Unix sort
is being called for each of the two files. The diff
is subsequently being run.
The Unix commands have matured over decades, and work with a high efficiency.
Upvotes: -1
Reputation: 66
I thought about it and the HashMap solution is instant. I went ahead and coded up an example of it here.
It runs in 0ms while the arrayLists ran in 16ms for the same dataset
public static void main(String[] args) throws Exception {
BufferedReader br1 = null;
BufferedReader br2 = null;
BufferedWriter bw3 = null;
String sCurrentLine;
int linelength;
HashMap<String, Integer> expectedrecords = new HashMap<String, Integer>();
HashMap<String, Integer> actualrecords = new HashMap<String, Integer>();
br1 = new BufferedReader(new FileReader("expected.txt"));
br2 = new BufferedReader(new FileReader("actual.txt"));
while ((sCurrentLine = br1.readLine()) != null) {
if (expectedrecords.containsKey(sCurrentLine)) {
expectedrecords.put(sCurrentLine, expectedrecords.get(sCurrentLine) + 1);
} else {
expectedrecords.put(sCurrentLine, 1);
}
}
while ((sCurrentLine = br2.readLine()) != null) {
if (expectedrecords.containsKey(sCurrentLine)) {
int expectedCount = expectedrecords.get(sCurrentLine) - 1;
if (expectedCount == 0) {
expectedrecords.remove(sCurrentLine);
} else {
expectedrecords.put(sCurrentLine, expectedCount);
}
} else {
if (actualrecords.containsKey(sCurrentLine)) {
actualrecords.put(sCurrentLine, actualrecords.get(sCurrentLine) + 1);
} else {
actualrecords.put(sCurrentLine, 1);
}
}
}
// expected is left with all records not present in actual
// actual is left with all records not present in expected
bw3 = new BufferedWriter(new FileWriter(new File("c.txt")));
bw3.write("Records which are not present in actual\n");
for (String key : expectedrecords.keySet()) {
for (int i = 0; i < expectedrecords.get(key); i++) {
bw3.write(key);
bw3.newLine();
}
}
bw3.write("Records which are in actual but not present in expected\n");
for (String key : actualrecords.keySet()) {
for (int i = 0; i < actualrecords.get(key); i++) {
bw3.write(key);
bw3.newLine();
}
}
bw3.flush();
bw3.close();
}
ex:
expected.txt
one
two
four
five
seven
eight
actual.txt
one
two
three
five
six
c.txt
Records which are not present in actual
four
seven
eight
Records which are in actual but not present in expected
three
six
ex 2:
expected.txt
one
two
four
five
seven
eight
duplicate
duplicate
duplicate
actual.txt
one
duplicate
two
three
five
six
c.txt
Records which are not present in actual
four
seven
eight
duplicate
duplicate
Records which are in actual but not present in expected
three
six
Upvotes: 3
Reputation: 614
In Java 8 you can use Collection.removeIf(Predicate<T>)
list1.removeIf(line -> list2.contains(line));
list2.removeIf(line -> list1.contains(line));
list1 will then contain everything that is NOT in list2 and list2 will contain everything, that is NOT in list1.
Upvotes: 1