Quickest Way to Compare Two String Arrays

Question

Context

I've written a small Java app for basic testing of data migration from Oracle to Microsoft.

The app does the following things:

Queries Oracle USER_TAB_COLUMNS table to gather details about each table and it's fields.
Generates SELECT statements from the information gathered
Runs the SELECT statements on both the ORACLE and Microsoft versions of the database, saving the results as a String for each row within a Table object.
For each table, compares the rows to find discrepancies
Outputs a text file for each table, listing the mismatched rows. (For analysis)

Issue

The issue I'm having is in the comparison of the two String arrays I have (Oracle Rows and Microsoft Rows). For some tables, there can be almost a million rows of data. Though my current code can match 1000 Oracle rows to Microsoft ones within a few seconds - the time adds up.

Current Attempts at Fixing Issue

Concatenating to 'row' String when reading in data rather than during comparison. (Before I had fields as there own String, and concatenated before comparing)
Breaking from the inner loop once match has been found for a row.
Removing 'oracleTable.getRows().size()' from the loop as to only perform this calculation once.

Ideas

Removing the row counter. (Would this make much of a difference? It's harder to observe the progress / speed without the counter so it is hard to tell)
Removing the matched Microsoft Row from it's List. (I thought it would be a good idea to remove the String from the List of Microsoft rows so that the same row isn't compared twice. I'm unsure whether this will add in more processing than it will save - as it's difficult removing from a List whilst iterating through it.

Code

        numRowsOracle = oracleTable.getRows().size();
        numRowsMicrosoft = msTable.getRows().size();

        int orRowCounter = 0;
        boolean matched;

        // Each Oracle Row
        for (String or : oracleTable.getRows()) {
            matched = false;
            orRowCounter++;

            if (orRowCounter % 1000 == 0) {
                System.out.println("Oracle Row: " + orRowCounter + " / "
                        + numRowsOracle);
            }

            // Each Microsoft Row
            for (String mr : msTable.getRows()) {
                if (mr.equalsIgnoreCase(or)) {
                    matched = true;
                    break;
                }
            }
            if (!matched) { // Adding row to list of unmatched
                unmatchedRowStrings.add(or);
            }
        }
        // Writing report on table.
        exportlogs.writeTableLog(td.getTableName(), unmatchedRowStrings
                .size(), unmatchedRowStrings, numRowsOracle,
                numRowsMicrosoft);
    }

Any suggestions on how to speed this up? I'd accept ideas not only speeding up comparing the two arrays, but also storing the data differently. I have not used other types of String storage, such as hashmaps. Would something different be quicker?

corsiKa · Accepted Answer

This is untested, so take this with a grain of salt, especially if you're using non-ascii characters.

You can make a lowercase (or uppercase) verison of the data in a single pass and then use a hashset to validate them.

// make a single pass over oracle rows, so O(n)
Set oracleLower = new HashSet<>();
for(String or : oracleTable.getRows()) {
    oracleLower.add(or.toLowerCase());
}

// make a single pass over msft rows, so O(n)
Set msftLower = new HashSet<>();
for(String ms : microsoftTable.getRows()) {
    msftLower.add(ms.toLowerCase());
}

// make a single pass over oracle rows, again O(n)
for(String or : oracleLower) {
    // backed by a hash table, this has a constant time lookup
    if(!msftLower.contains(or)) {
        unmatched.add(or);
    }
}

Each operation is O(n), thanks to the hash table. This requires double the space requirements, though. Optimizations may be necessary to only make one collection lowercase (probably msft) and only make the other one (probably oracle) lowercase inside the loop - then it would be more like for(String or : oracleTable.getRows()) { or = or.toLowerCase(); if(!msftLower.contains(or)) { ... } }

Quickest Way to Compare Two String Arrays

Answers (1)

Related Questions