Reputation: 175
I want to find common lines between two files(large ones), one with 90 million lines and 1 with 100 thousands and also their line number.
comm -12 file1 file2
gives me the common lines, but I want to know the line number from the individual files as well
Upvotes: 6
Views: 2235
Reputation: 1118
This solution works for me on my small test files - I'm not sure how it will perform on a file with 90 million lines.
tab=` printf '\t' `
join -t"$tab" -j2 <( cat -n file1 ) <( cat -n file2 )
This works because cat -n
prepends a space-padded number followed by a tab character to each line. The join
then finds the common lines looking only at the stuff after the first tab.
After the join is complete, you should see the common lines, each followed by two numbers. The first number is the line number from file1 and the second from file2.
Caveat: This will work if the files don't have tab characters already. If that's not the case, you can use sed to convert the first tab to a 'safe' character.
safe="|"
join -t"$safe" -j2 \
<( cat -n file1 | sed -e "s:\t:$safe:" ) \
<( cat -n file2 | sed -e "s:\t:$safe:" )
Also, depending on how join
is implemented, you may want to have the smaller file listed in the first process substitution and the larger one in the second. This way the smaller file may all fit in memory and the larger file might be scanned and matching lines selected efficiently. I have no idea if this is the case, but it might be worth a shot.
Upvotes: 2
Reputation: 58788
You can get halfway there with diff
. This shows you the line numbers in file1
, but unfortunately there doesn't seem to be any option to show the line number from file2
- it seems man diff
assumes that an unchanged line is also on the same line in both files, which is contrary to how it normally works.
diff --unchanged-line-format=$'%dn\t%L' --old-line-format='' --new-line-format='' file1 file2
Another half measure using unified diff:
diff -u file1 file2
This shows different lines with a bit of context, meaning you can infer which lines the common text is on. The lines starting with @@
give you the line information. For example:
@@ -1,5 +2,10 @@
This means the next line starting with -
or a space in the diff is line 1 in file1
, and that the next line starting with +
or a space is line 2 in file2
. For your purposes you can ignore the numbers after the comma.
Upvotes: 0
Reputation: 40718
You can try:
awk '
FNR==NR {
a[$0]++
next
}
$0 in a {
print
delete a[$0]
}' file1 file2
If you also want to get the line numbers, you can use arrays of arrays in gawk version 4 like:
FNR==NR {
a[$0][FNR]++
file1=FILENAME
next
}
FNR==1 {
file2=FILENAME
}
$0 in a {
b[$0][FNR]++
}
END {
for(i in b) {
print "Line: " i
print " Line numbers in "file1":"
printf " "
for (j in a[i])
printf "%s,", j
print ""
print " Line numbers in "file2":"
printf " "
for (j in b[i])
printf "%s,", j
print ""
}
}
Upvotes: 0