Reputation: 115
I have two txt files (tab delimited) I'm trying to compare with awk. I want to compare column 1 of file1 to column 5 of file2 and any time a string from file1 column 1 matches file2 column 5 I want to print (eventually to a new txt file) the entire row of file2. Theres about 14000 lines in file 1 with just one column and 40000 rows with 6 columns in file2. Below are some test files I've been using to try to figure out how to use awk to get my desired output.
file1
AAGAB
AAK1
AAMDC
AAMP
AAR2
AARD
AARS
file2
chr22 38035683 38052050 NM_018957 AAMDC +
chrX 47001614 47004609 NM_019056 AAMP -
chr9 21994789 22077889 NR_047539 AAR2 +
chr16 20370491 20416033 NM_174924 AARD -
chr1 210111518 210337633 NM_001146262 AARS +
chrX 30233674 30238206 NM_002364 MAGEB2 +
chrX 30261847 30270155 NM_002363 MAGEB1 +
I've tried adopting some other awk questions with similar desired output from the abundance of similar awk questions on SO:
awk -F '\t' 'NR==FNR{c[$1]++;next}c[$5]' file1 file2
But I only get one match for the AARS row, even though there are other matches in the test files:
chr1 210111518 210337633 NM_001146262 AARS +
The output I am looking for is:
chr14 94463615 94473898 NR_024182 AAGAB +
chr10 74033676 74035797 NM_019058 AAK1 +
chr22 38035683 38052050 NM_018957 AAMDC +
chrX 47001614 47004609 NM_019056 AAMP -
chr9 21994789 22077889 NR_047539 AAR2 +
chr16 20370491 20416033 NM_174924 AARD -
chr1 210111518 210337633 NM_001146262 AARS +
I've tried adopting a couple other posts on stackoverflow but keep getting the same output of just 1 match even though multiple others exist:
awk to match file1 with file2 and output matches
awk -F '\t' 'NR==FNR{c[$1]; next} ($5 in c)' file1 file2
Find rows with the same value in a column in two files
awk -F '\t' 'NR==FNR{A[$1];next}$5 in A' file1 file2
Comparing two columns in two files using awk
awk -F '\t' 'FNR==NR {a[$1];next} {for (i in a) if ($5~i) print}' file1 file2
Can someone help me understand where my awk line is going wrong?
I'm currently using GNU Awk 4.2.1, API: 2.0
Upvotes: 5
Views: 3091
Reputation: 103744
If there is a possibility that your files are not really tab delimited, you can use sed
to replace runs of spaces and tabs with a single tab in-place. This assumes that there are not spaces in your data:
sed -i .bak 's/[[:space:]][[:space:]]*/\t/g' file
Once you have the delimiter worked out, you can use the Unix join
command:
join -t $'\t' -1 1 -2 5 -o "2.1 2.2 2.3 2.4 2.5 2.6" <(sort file_1) <(sort -k5,5 file_2)
The files need to be sorted on the join field. The sort is not necessary if already sorted by that field.
Or, this awk works:
awk -F '\t' 'FNR==NR{tag[$1]; next} $5 in tag' file_1 file_2
Join has the advantage of working with files larger than memory. The awk does not require sorting and is likely faster if you have the memory.
Either (with your two input examples) prints:
chr22 38035683 38052050 NM_018957 AAMDC +
chrX 47001614 47004609 NM_019056 AAMP -
chr9 21994789 22077889 NR_047539 AAR2 +
chr16 20370491 20416033 NM_174924 AARD -
chr1 210111518 210337633 NM_001146262 AARS +
Upvotes: 0
Reputation: 15246
Check that your files have the delimiter you expect.
I copy/pasted your data, and got success with
awk 'NR==FNR{hit[$1];next} $5 in hit' file1 file2
but none with
awk -F $'\t' 'NR==FNR{hit[$1];next} $5 in hit' file1 file2
I opened the file and substituted all consecutive spaces with tabs and ran it again, and both worked the same.
Sometimes too much specificity is counterproductive.
Upvotes: 0
Reputation: 96
I copied your example content in "file1" and "file2" and wrote the following to "script.awk"
BEGIN{
while(getline < "file1") {
a[$1]=1
}
close("file1")
}
{
if($5 in a){
print
}
}
Running
gawk -f script.awk file2
the result is
chr22 38035683 38052050 NM_018957 AAMDC +
chrX 47001614 47004609 NM_019056 AAMP -
chr9 21994789 22077889 NR_047539 AAR2 +
chr16 20370491 20416033 NM_174924 AARD -
chr1 210111518 210337633 NM_001146262 AARS +
wich in my understanding is correct. Please check.
Upvotes: 0
Reputation: 189
Read from file1
into an associative array. Then read file2
and for each line looking for a match.
$ awk 'BEGIN{while(getline line<"file1") {x[line]="1";}} {for (i=1;i<=NF;i++) if(x[$i]) {print; break} }' file2
chr22 38035683 38052050 NM_018957 AAMDC +
chrX 47001614 47004609 NM_019056 AAMP -
chr9 21994789 22077889 NR_047539 AAR2 +
chr16 20370491 20416033 NM_174924 AARD -
chr1 210111518 210337633 NM_001146262 AARS +
Upvotes: 0
Reputation: 1871
I think you almost had it... I did check -- and the following works for me.
awk -F '\t' 'BEGIN { split("", a) } NR == FNR { a[$0] = 1; next } $5 in a' file1 file2
Result (the output in your example is not consistent with file2):
chr22 38035683 38052050 NM_018957 AAMDC +
chrX 47001614 47004609 NM_019056 AAMP -
chr9 21994789 22077889 NR_047539 AAR2 +
chr16 20370491 20416033 NM_174924 AARD -
chr1 210111518 210337633 NM_001146262 AARS +
It is possible that there is something wrong with the FS, I suppose... You could try to make FS hit on both space and tab & also attempt to remove any invisible and non-printable characters from the mess?
awk -F '[[:blank:]]+' '
BEGIN { split("", a) }
NR == FNR {
gsub(/[^[:graph:]]/, "")
a[$0] = 1
next
}
{
p = $5
gsub(/[^[:graph:]]/, "", p)
}
p in a' file1 file2
Upvotes: 0