Reputation: 13
I need to match a single column entry in one file to a column entry in a second file that consists of a list (in shell). The awk command I've used only matches to the first word of the list, and doesn't scan through the entire list in the column field.
File 1 looks like this:
chr1:725751 LOC100288069
rs3131980 LOC100288069
rs28830877 LINC01128
rs28873693 LINC01128
rs34221207 ATP4A
File 2 looks like this:
Annotation Total Genes With Ann Your Genes With Ann) Your Genes No Ann) Genome With Ann) Genome No Ann) ln
1 path hsa00190 Oxidative phosphorylation 55 55 1861 75 1139 5.9 9.64 0 0 ATP12A ATP4A ATP5A1 ATP5E ATP5F1 ATP5G1 ATP5G2 ATP5G3 ATP5J ATP5O ATP6V0A1 ATP6V0A4 ATP6V0D2 ATP6V1A ATP6V1C1 ATP6V1C2 ATP6V1D ATP6V1E1 ATP6V1E2 ATP6V1G3 ATP6V1H COX10 COX17 COX4I1 COX4I2 COX5A COX6B1 COX6C COX7A1 COX7A2 COX7A2L COX7C COX8A NDUFA5 NDUFA9 NDUFB3 NDUFB4 NDUFB5 NDUFB6 NDUFS1 NDUFS3 NDUFS4 NDUFS5 NDUFS6 NDUFS8 NDUFV1 NDUFV3 PP PPA2 SDHA SDHD TCIRG1 UQCRC2 UQCRFS1 UQCRH
Expected output:
rs34221207 ATP4A hsa00190
(please excuse the formatting - all the columns are tab-delimited until the column of gene names, $14, called Genome...)
My command is this:
awk 'NR==FNR{a[$14]=$3; next}a[$2]{print $0 "\t" a[$2]}' file2 file 1
All help will be much appreciated!
Upvotes: 0
Views: 111
Reputation: 2456
You need to process files in the other order, and loop over your list:
awk 'NR==FNR{a[$2]=$1; next} {for(i=15;i<=NF;++i)if(a[$i]){print a[$i] "\t" $i "\t" $3}}' file1 file2
Explanation:
NR
is a global "record number" counter that awk increments for each line read from each file. FNR
is a per-file "record number" that awk resets to 1 on the first line of each file. So the NR==FNR
condition is true for lines in the first file and false for lines in subsequent files. It is an awk idiom for picking out just the first file info. In this case, a[$2]=$1
stores the first field text keyed by the second field text. The next
tells awk to stop short on the current line and to read and continue processing normally the next line. A next
at the end of the first action clause like this is functionally like an ELSE
condition on the remaining code if awk had such a syntax (which it doesn't): NR==FNR{a[$2]=$1} ELSE {for...
. More clear and only slightly less time-efficient would have been to write instead NR==FNR{a[$2]=$1}NR!=FNR{for...
.
Now to the second action clause. No condition preceding it means awk will do it for every line that is not short-circuited by the preceding next
, that is, all lines in files other than the first -- file2 only in this case. Your file2 has a list of potential keys starting in field #15 and extending to the last field. The awk built-in variable for the last field number is NF
(number of fields). The for
loop is pretty self-explanatory then, looping over just those field numbers. For each of those numbers i
we want to know if the text in that field $i
is a known key from the first file -- a[$i]
is set, that is, evaluates to a non-empty (non-false) string. If so, then we've got our file1 first field in a[$i]
, our matching file1 second field in $i
, and our file2 field of interest in $3
(the text of the current file2 3rd field). Print them tab-separated. The next
here is an efficiency-only measure that stops all processing on the file2 record once we've found a match. If your file2 key list might contain duplicates and you want duplicate output lines if there is a match on such a duplicate, then you must remove that last next
.
Actually now that I look again, you probably do want to find any multiple matches even on non-duplicates, so I have removed the 2nd next
from the code.
Upvotes: 1