Reputation: 533
Context
I've written a small Java app for basic testing of data migration from Oracle to Microsoft.
The app does the following things:
Issue
The issue I'm having is in the comparison of the two String arrays I have (Oracle Rows and Microsoft Rows). For some tables, there can be almost a million rows of data. Though my current code can match 1000 Oracle rows to Microsoft ones within a few seconds - the time adds up.
Current Attempts at Fixing Issue
Ideas
Code
numRowsOracle = oracleTable.getRows().size();
numRowsMicrosoft = msTable.getRows().size();
int orRowCounter = 0;
boolean matched;
// Each Oracle Row
for (String or : oracleTable.getRows()) {
matched = false;
orRowCounter++;
if (orRowCounter % 1000 == 0) {
System.out.println("Oracle Row: " + orRowCounter + " / "
+ numRowsOracle);
}
// Each Microsoft Row
for (String mr : msTable.getRows()) {
if (mr.equalsIgnoreCase(or)) {
matched = true;
break;
}
}
if (!matched) { // Adding row to list of unmatched
unmatchedRowStrings.add(or);
}
}
// Writing report on table.
exportlogs.writeTableLog(td.getTableName(), unmatchedRowStrings
.size(), unmatchedRowStrings, numRowsOracle,
numRowsMicrosoft);
}
Any suggestions on how to speed this up? I'd accept ideas not only speeding up comparing the two arrays, but also storing the data differently. I have not used other types of String storage, such as hashmaps. Would something different be quicker?
Upvotes: 2
Views: 317
Reputation: 82579
This is untested, so take this with a grain of salt, especially if you're using non-ascii characters.
You can make a lowercase (or uppercase) verison of the data in a single pass and then use a hashset to validate them.
// make a single pass over oracle rows, so O(n)
Set<String> oracleLower = new HashSet<>();
for(String or : oracleTable.getRows()) {
oracleLower.add(or.toLowerCase());
}
// make a single pass over msft rows, so O(n)
Set<String> msftLower = new HashSet<>();
for(String ms : microsoftTable.getRows()) {
msftLower.add(ms.toLowerCase());
}
// make a single pass over oracle rows, again O(n)
for(String or : oracleLower) {
// backed by a hash table, this has a constant time lookup
if(!msftLower.contains(or)) {
unmatched.add(or);
}
}
Each operation is O(n), thanks to the hash table. This requires double the space requirements, though. Optimizations may be necessary to only make one collection lowercase (probably msft) and only make the other one (probably oracle) lowercase inside the loop - then it would be more like for(String or : oracleTable.getRows()) { or = or.toLowerCase(); if(!msftLower.contains(or)) { ... } }
Upvotes: 2