K Y
K Y

Reputation: 378

Partially match to words from two different files and extract rows sed,awk,grep?

I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.

File1:

smt_hsa_3150    932
smt_hsa_28592   682
smt_hsa_5184    657
smt_hsa_430 648
smt_hsa_14100   648
smt_hsa_96 648

File2:

chr11 5933549 5933577 29 + hsa_smt_028592
chr11 45693060 45693086 27 - hsa_smt_000059
chr11 45699803 45699832 30 - hsa_smt_000087
chr2 131291172 131291197 26 - hsa_smt_000096

I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592. and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.

output:

chr11 5933549 5933577 29 + hsa_smt_028592 682   
chr2 131291172 131291197 26 - hsa_smt_000096 648

As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform

awk '{gsub("smt","hsa")}1'

then, I cannot use the same code to change only the second "hsa". The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.

Upvotes: 1

Views: 128

Answers (1)

Corentin Limier
Corentin Limier

Reputation: 5006

For fun using join :

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %s\n", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)

awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).


awk -F' *|_' -> I split on multiple spaces and "_" character

'{printf "hsa_smt_%06s %s\n", $3, $4}' -> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed

sort -k6 file2 or awk ... | sort -k1 I sort the files on 6th or 1st columns

join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 I join the files on the first column for file 1 (-11) and 6th column for file 2 (-26) then I order the columns (2.1 -> 1st column of file2, etc...)


Pure awk solution :

awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt

In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.


NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next} this block will be executed only on file1 because of the NR==FNR condition. It splits the first column, convert the smt_hsa id to integer (doing +0 is a classic way to convert string to integer in awk), then stores second column inside an array named smt.

next; prevents executing the other blocks for file1

{split($6,s,"_");k=s[3]+0} this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).

k in smt{print $0, smt[k]} will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)


I'm pretty sure that the second solution is more performant but may be greedier in memory.

Both give same output, order of lines may differ :

chr11 5933549 5933577 29 + hsa_smt_028592 682
chr2 131291172 131291197 26 - hsa_smt_000096 648

Upvotes: 1

Related Questions