user31641
user31641

Reputation: 175

Find common lines between two files and also their line number

I want to find common lines between two files(large ones), one with 90 million lines and 1 with 100 thousands and also their line number.

comm -12 file1 file2

gives me the common lines, but I want to know the line number from the individual files as well

Upvotes: 6

Views: 2235

Answers (3)

carl.anderson
carl.anderson

Reputation: 1118

This solution works for me on my small test files - I'm not sure how it will perform on a file with 90 million lines.

tab=` printf '\t' `
join -t"$tab" -j2 <( cat -n file1 ) <( cat -n file2 )

This works because cat -n prepends a space-padded number followed by a tab character to each line. The join then finds the common lines looking only at the stuff after the first tab.

After the join is complete, you should see the common lines, each followed by two numbers. The first number is the line number from file1 and the second from file2.

Caveat: This will work if the files don't have tab characters already. If that's not the case, you can use sed to convert the first tab to a 'safe' character.

safe="|"
join -t"$safe" -j2 \
  <( cat -n file1 | sed -e "s:\t:$safe:" ) \
  <( cat -n file2 | sed -e "s:\t:$safe:" )

Also, depending on how join is implemented, you may want to have the smaller file listed in the first process substitution and the larger one in the second. This way the smaller file may all fit in memory and the larger file might be scanned and matching lines selected efficiently. I have no idea if this is the case, but it might be worth a shot.

Upvotes: 2

l0b0
l0b0

Reputation: 58788

You can get halfway there with diff. This shows you the line numbers in file1, but unfortunately there doesn't seem to be any option to show the line number from file2 - it seems man diff assumes that an unchanged line is also on the same line in both files, which is contrary to how it normally works.

diff --unchanged-line-format=$'%dn\t%L' --old-line-format='' --new-line-format='' file1 file2

Another half measure using unified diff:

diff -u file1 file2

This shows different lines with a bit of context, meaning you can infer which lines the common text is on. The lines starting with @@ give you the line information. For example:

@@ -1,5 +2,10 @@

This means the next line starting with - or a space in the diff is line 1 in file1, and that the next line starting with + or a space is line 2 in file2. For your purposes you can ignore the numbers after the comma.

Upvotes: 0

H&#229;kon H&#230;gland
H&#229;kon H&#230;gland

Reputation: 40718

You can try:

awk '
FNR==NR {
    a[$0]++
    next
}
$0 in a {
    print
    delete a[$0]
}' file1 file2

If you also want to get the line numbers, you can use arrays of arrays in gawk version 4 like:

FNR==NR {
    a[$0][FNR]++
    file1=FILENAME
    next
}
FNR==1 {
    file2=FILENAME
}

$0 in a {
    b[$0][FNR]++
}

END {
    for(i in b) {
        print "Line: " i
        print " Line numbers in "file1":"
        printf "  "
        for (j in a[i])
            printf "%s,", j
        print ""
        print " Line numbers in "file2":"
        printf "  "
        for (j in b[i])
            printf "%s,", j
        print ""
    }
}

Upvotes: 0

Related Questions