kushi
kushi

Reputation: 389

File comparison - Contents may be unordered

The files under folder1 and folder2 will have same names and I want 2 compare those files. Am struck with this. Is there any JAVA API for doing this comparison. The file sizes may be huge

Example:

folder1/file1
----------
kushi,metha,2
kushi,barun,1
arun,mital,3

folder2/file1
----------
arun,mital,3
kushi,metha,2
sheetal,kumar,3
kushi,barun,1

The comparison of file1 and file2 should return "sheetal kumar 3" I tried googling but not able to find anything useful.

Upvotes: 1

Views: 871

Answers (3)

Cause Chung
Cause Chung

Reputation: 1505

I encountered the same problem, and write a comparison function:

/**
 * Compare two sequences of lines without considering order.
 * <p>
 * Input parameter will not be modified.
 */
public static <T> boolean isEqualWithoutOrder(final T[] lines1, final T[] lines2) {
    if (lines1 == null && lines2 == null) return true;
    if (lines1 == null) return false;
    if (lines2 == null) return false;
    if (lines1.length != lines2.length) return false;

    final int length = lines1.length;
    int equalCnt = 0;

    final boolean[] mask = new boolean[length];
    Arrays.fill(mask, true);

    for (int i = 0; i < lines2.length; i++) {
        final T line2 = lines2[i];
        for (int j = 0; j < lines1.length; j++) {
            final T line1 = lines1[j];
            if (mask[j] && Objects.equal(line1, line2)) {
                equalCnt++;
                mask[j] = false;

                //if two equal lines is found, more subsequent equal lines are speculated
                while (j + 1 < length && i + 1 < length &&
                        Objects.equal(lines1[j + 1], lines2[i + 1])) {
                    equalCnt++;
                    mask[j + 1] = false;
                    j++;
                    i++;
                }

                break;
            }
        }
        if (equalCnt < i) return false;
    }
    return equalCnt == length;
}

Common collections may be slow, speed comparison:

//lines1: Seq[String], lines2: Seq[String] of 100k lines of equal Random String but without ordering.
FastUtils.isEqualWithoutOrder(lines1.toArray, lines2.toArray) //97 ms
lines1.sorted == lines2.sorted //836 ms

Time measured in hot sbt environment.

(Disclaimer: I only did some basic test against this function)

Upvotes: 0

Tejas Kale
Tejas Kale

Reputation: 415

I know this is not a pure java solution, but if you have access to a *nix box :

sort file1 > sorted1; sort file2 > sorted2;comm -3 sorted1 sorted2;

Would give you exactly what you need.

And then take a look at this question on how you can run shell scripts from java.

EDIT:

What I am trying to say is that for you to compute the diff there are 2 steps :

  1. Sort both the files.
  2. Compare them line by line to find the differences.

Upvotes: 2

npinti
npinti

Reputation: 52185

Depending on what you mean by huge, you could use a HashSet to first go through one file and add each line to the hash set, then, go through the other file and from the hash set, remove the lines you are now reading from the other file. This assumes that each line is unique.

Upvotes: 0

Related Questions