Check if two files match in 2 column values and print those lines to a new output file

Question

I would like to match two files based on two column values per file. If both of the values of "BP" and "P" match in the same line, I want to print those lines on a third file, which is like file 2.

File 1:

CHR BP BETA SE P PHENOTYPE FDR CATEGORY SNP
10 110408937 3.386e+00 1.333e+00 1.112e-02 1 1 Medication rs113627704
10 110408937 4.409e+00 1.623e+00 6.602e-03 2 1 Cardiovascular rs113627704
10 110408937 2.382e+00 1.124e+00 3.414e-02 3 1 Medication rs113627704

File 2:

CHR F SNP BP P TOTAL
10 1 rs113627704 110408937 1.112e-02 456
4 1 rs43567 2345677 0.045457 567
3 1 rs567899 479899 0.3456 223

Desired output:

CHR BP BETA SE P PHENOTYPE FDR CATEGORY SNP
10 110408937 3.386e+00 1.333e+00 1.112e-02 1 1 Medication rs113627704

I have tried the following two:

awk 'FNR==NR{a[$4,$5]=$0;next}{if(b=a[$2,$5]){print b}}' file1 file2 > file3

Here I get the error "bash: awk: command not found." I use awk all the time and it always works.

awk 'FNR==NR {a[$4,$5]=$0; next} ($4,$5) in a {print a[$2,$5], $0}' file1 file2 > file3

Here I get an empty file.

James Brown · Accepted Answer

This should work:

$ awk 'NR==FNR{a[$4,$5]=$0;next}(($2,$5) in a)' file2 file1

Output:

CHR BP BETA SE P PHENOTYPE FDR CATEGORY SNP
10 110408937 3.386e+00 1.333e+00 1.112e-02 1 1 Medication rs113627704

Explained:

$ awk '
NR==FNR {         # process file2 as output we want are from file1
    a[$4,$5]=$0   # desired fields are 4th and 5th, use them as hash key
    next          # move to next record
}                 # process file1 below this point
(($2,$5) in a)    # test if 2nd and 5th in hash and output
' file2 file1     # mind the file order

Check if two files match in 2 column values and print those lines to a new output file

Answers (2)

Related Questions