cutts
cutts

Reputation: 115

AWK comparing two columns in two files output matching row - missing rows from matches

I have two txt files (tab delimited) I'm trying to compare with awk. I want to compare column 1 of file1 to column 5 of file2 and any time a string from file1 column 1 matches file2 column 5 I want to print (eventually to a new txt file) the entire row of file2. Theres about 14000 lines in file 1 with just one column and 40000 rows with 6 columns in file2. Below are some test files I've been using to try to figure out how to use awk to get my desired output.

file1

AAGAB
AAK1
AAMDC
AAMP
AAR2
AARD
AARS

file2

chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +
chrX    30233674    30238206    NM_002364   MAGEB2  +
chrX    30261847    30270155    NM_002363   MAGEB1  +

I've tried adopting some other awk questions with similar desired output from the abundance of similar awk questions on SO:

awk -F '\t' 'NR==FNR{c[$1]++;next}c[$5]' file1 file2

But I only get one match for the AARS row, even though there are other matches in the test files:

chr1    210111518   210337633   NM_001146262    AARS    +

The output I am looking for is:

chr14   94463615    94473898    NR_024182   AAGAB   +
chr10   74033676    74035797    NM_019058   AAK1    +
chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +

I've tried adopting a couple other posts on stackoverflow but keep getting the same output of just 1 match even though multiple others exist:

awk to match file1 with file2 and output matches

awk -F '\t' 'NR==FNR{c[$1]; next} ($5 in c)' file1 file2

Find rows with the same value in a column in two files

awk -F '\t' 'NR==FNR{A[$1];next}$5 in A' file1 file2

Comparing two columns in two files using awk

awk -F '\t' 'FNR==NR {a[$1];next} {for (i in a) if ($5~i) print}' file1 file2

Can someone help me understand where my awk line is going wrong?

I'm currently using GNU Awk 4.2.1, API: 2.0

Upvotes: 5

Views: 3091

Answers (5)

dawg
dawg

Reputation: 103744

If there is a possibility that your files are not really tab delimited, you can use sed to replace runs of spaces and tabs with a single tab in-place. This assumes that there are not spaces in your data:

sed -i .bak 's/[[:space:]][[:space:]]*/\t/g' file

Once you have the delimiter worked out, you can use the Unix join command:

join -t $'\t' -1 1 -2 5 -o "2.1 2.2 2.3 2.4 2.5 2.6" <(sort file_1) <(sort -k5,5 file_2)

The files need to be sorted on the join field. The sort is not necessary if already sorted by that field.

Or, this awk works:

awk -F '\t' 'FNR==NR{tag[$1]; next} $5 in tag' file_1 file_2

Join has the advantage of working with files larger than memory. The awk does not require sorting and is likely faster if you have the memory.

Either (with your two input examples) prints:

chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +

Upvotes: 0

Paul Hodges
Paul Hodges

Reputation: 15246

Check that your files have the delimiter you expect.

I copy/pasted your data, and got success with

awk 'NR==FNR{hit[$1];next} $5 in hit' file1 file2

but none with

awk -F $'\t' 'NR==FNR{hit[$1];next} $5 in hit' file1 file2

I opened the file and substituted all consecutive spaces with tabs and ran it again, and both worked the same.

Sometimes too much specificity is counterproductive.

Upvotes: 0

rossifr
rossifr

Reputation: 96

I copied your example content in "file1" and "file2" and wrote the following to "script.awk"

BEGIN{
    while(getline < "file1") {
        a[$1]=1
    }
    close("file1")
} 
{
    if($5 in a){
        print
    }
}

Running

gawk -f script.awk file2

the result is

chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +

wich in my understanding is correct. Please check.

Upvotes: 0

dr-who
dr-who

Reputation: 189

Read from file1 into an associative array. Then read file2 and for each line looking for a match.

$ awk 'BEGIN{while(getline line<"file1") {x[line]="1";}} {for (i=1;i<=NF;i++) if(x[$i]) {print; break} }' file2
chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +

Upvotes: 0

Michael Back
Michael Back

Reputation: 1871

I think you almost had it... I did check -- and the following works for me.

awk -F '\t' 'BEGIN { split("", a) } NR == FNR { a[$0] = 1; next } $5 in a' file1 file2

Result (the output in your example is not consistent with file2):

chr22   38035683    38052050    NM_018957   AAMDC   +
chrX    47001614    47004609    NM_019056   AAMP    -
chr9    21994789    22077889    NR_047539   AAR2    +
chr16   20370491    20416033    NM_174924   AARD    -
chr1    210111518   210337633   NM_001146262    AARS    +

It is possible that there is something wrong with the FS, I suppose... You could try to make FS hit on both space and tab & also attempt to remove any invisible and non-printable characters from the mess?

awk -F '[[:blank:]]+' '
    BEGIN { split("", a) }
    NR == FNR {
        gsub(/[^[:graph:]]/, "")
        a[$0] = 1
        next
    }
    {
        p = $5
        gsub(/[^[:graph:]]/, "", p)
    }
    p in a' file1 file2

Upvotes: 0

Related Questions