How to compare one file with bunch of files in linux

Question

I have a fileA as shown below:

file A

chr1   123 aa b c d
chr1   234 a  b c d
chr1   345 aa b c d
chr1   456 a  b c d
....

And I have a bunch of similar files with similar columns in a dirB with which i have to compare file A.

To do this I concatenated all the files in dirB using cat into a single file called fileB and then compared both the files based on key columns 1 and 2 as shown below:

awk 'FNR==NR{a[$1,$2]++;next}!a[$1,$2]' fileB fileA

This command uses the columns 1 and 2 as keys and gives the rows which have key only in fileA.

However, the issue here is, fileB is to huge to handle in terms of space and memory to run when there are large number of files.

Could someone suggest an alternative, so that it skips the step of concatenating all files to create fileB. Instead, fileA could be directly compared with all the files in dirB

chr1   123    aa    b    c    d    xxxx    abcd
chr1   234    a     b    c    d
chr1   345    aa    b    c    d    yyyy    defg
chr1   456    a    b    c    d

jas · Accepted Answer

Perhaps something along these lines:

 awk 'NR == FNR { a[$1,$2] = $0; next } 
                { delete a[$1, $2] }
            END { for (i in a) print a[i] }
 ' a.txt b1.txt b2.txt ...

Starting with file A, add each key to an array with the contents of its row for the value. Then for all the B files, delete any elements from the array with matching keys. At the end any elements remaining are those in A that weren't in any of the B files so we can just loop through and print them out.

How to compare one file with bunch of files in linux

Answers (1)

Related Questions