Reputation: 561

Faster solution to compare files in bash

file1:

chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198

and file2:

NR_024540 11

I need find match file2 in file1 and print whole file1 + second column of file2

So ouptut is:

  chr1  14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11

My solution is very slow in bash:

#!/bin/bash

while read line; do

c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output


 done < file2

I am prefer FASTER any bash or awk solution. Output can be modified, but need keep all the informations (order of column can be different).

EDIT:

Right now it looks like fastest solution according @chepner:

#!/bin/bash

while read -r c d; do

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' 

done < file2 > output

Upvotes: 2

Answers (7)

James Brown

Reputation: 37404

If the searched string is always the same length (length("NR_024540")==9):

awk 'NR==FNR{a[$1]=$2;next} (i=substr($4,1,9)) && (i in a){print $0, a[i]}' file2 file1

Explained:

NR==FNR {                         # process file2
    a[$1]=$2                      # hash record using $1 as the key
    next                          # skip to next record
} 
(i=substr($4,1,9)) && (i in a) {  # read the first 9 bytes of $4 to i and search in a
    print $0, a[i]                # output if found
}

Upvotes: 1

agc

Reputation: 8406

No awk or sed needed. This assumes file2 is only one line:

n="`cut -f 2 file2`" ; while read x ; do echo "$x $n" ; done < file1

Upvotes: 0

NeronLeVelu

Reputation: 10039

awk -F '[[:blank:]_]+' '
   FNR==NR { a[$2]=$3 ;next }
   { if ( $5 in a ) $0 = $0 " " a[$5] }
   7
   ' file2 file1

Comment:

use _ as extra field separator so file names are easier to compare in both file (using only the number part).
7 is for fun, it's just a non 0 value -> print the line
i don't change the field (NF+1, ...) so we keep the original format adding just the referenced number

a smaller oneliner code (optimized for code size) (assuming non empty lines in file1 that are mandatory). if separator are only space, you can remplace [:blank:] by a space char

awk -F '[[:blank:]_]+' 'NF==3{a[$2]=$3;next}$0=$0" "a[$5]' file2 file1

Upvotes: 0

VIPIN KUMAR

Reputation: 3137

try this -

 cat file2
NR_024540 11
NR_024541 12

 cat file11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14361   14829   NR_024542_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15


awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11

Performance - (Tested for 55000 records)

time awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file1 > output1

real    0m0.16s
user    0m0.14s
sys     0m0.01s

Upvotes: 2

Jose Ricardo Bustos M.

Reputation: 8164

Another solution using join and sed, Under the assumption that file1 and file2 are sorted

join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output

If the output order doesn't matter, to use awk

awk 'FNR==NR{d[$1]=$2; next}
    {split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]}
' file2 file1

you get,

chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

Upvotes: 2

Inian

Reputation: 85580

In a single Awk command,

awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1

chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

A more readable version in a multi-liner

FNR==NR {
    # map the values from 'file2' into the hash-map 'map'
    map[$1]=$2
    next
}
# On 'file1' do
{
    # Iterate through the array map
    for (i in map){
        # If there is a direct regex match on the line with the 
        # element from the hash-map, print it and append the 
        # hash-mapped value at last
        if($0 ~ i){
            $(NF+1)=map[i]
            print
            next
        }
    }
}

Upvotes: 5

chepner

Reputation: 531075

You are starting a lot of external programs unnecessarily. Let read split the incoming line from file2 for you instead of calling awk twice. There is also no need to run grep; awk can do the filtering itself.

while read -r c d; do
    awk -v field="$c" -v line="$d" -v OFS='\t' '$0 ~ field {print $1,$2,$3,$4"_"line}' file1
done < file2 > output

Upvotes: 1

Faster solution to compare files in bash

Answers (7)

Related Questions