Reputation: 378
I have two different files that I want to partially match one number from the first one with another number from the other one and extract the whole column.
File1:
smt_hsa_3150 932
smt_hsa_28592 682
smt_hsa_5184 657
smt_hsa_430 648
smt_hsa_14100 648
smt_hsa_96 648
File2:
chr11 5933549 5933577 29 + hsa_smt_028592
chr11 45693060 45693086 27 - hsa_smt_000059
chr11 45699803 45699832 30 - hsa_smt_000087
chr2 131291172 131291197 26 - hsa_smt_000096
I need to match smt_hsa_28592 or 28592 with hsa_smt_028592 or 028592. and then extract to a new file the line from the second file plus the number from 2nd column of the 1st file.
output:
chr11 5933549 5933577 29 + hsa_smt_028592 682
chr2 131291172 131291197 26 - hsa_smt_000096 648
As I'm new to awk/sed programming I tried first to change the name of the first column of the 1st file from smt_hsa_3150 to hsa_smt_3150, but when I perform
awk '{gsub("smt","hsa")}1'
then, I cannot use the same code to change only the second "hsa". The second problem is how I would be able to match hsa_smt_028592 with smt_has_28592 or smt_hsa_96 with hsa_smt_000096.
Upvotes: 1
Views: 128
Reputation: 5006
For fun using join :
join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2 <(awk -F' +|_' '{printf "hsa_smt_%06s %s\n", $3, $4}' file1.txt | sort -k1) <(sort -k6 file2.txt)
awk command is formatting the first column of your file1 in same format as file2. Then we join the two files (first we need to sort them both).
awk -F' *|_'
-> I split on multiple spaces and "_" character
'{printf "hsa_smt_%06s %s\n", $3, $4}'
-> I print hsa_smt_< third column as an integer of width 6 > <4th column>. Awk will add leading zeros if needed
sort -k6 file2
or awk ... | sort -k1
I sort the files on 6th or 1st columns
join -11 -26 -o 2.1,2.2,2.3,2.4,2.5,1.1,1.2
I join the files on the first column for file 1 (-11
) and 6th column for file 2 (-26
) then I order the columns (2.1
-> 1st column of file2, etc...)
Pure awk solution :
awk 'NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}{split($6,s,"_");k=s[3]+0}k in smt{print $0, smt[k]}' file1.txt file2.txt
In this solution, I split smt_hsa_xxx and hsa_smt_xxx, convert the 3rd part to integers so I can compare them.
NR==FNR{split($1,s,"_");smt[s[3]+0]=$2;next}
this block will be executed only on file1 because of the NR==FNR
condition. It splits the first column, convert the smt_hsa id to integer (doing +0 is a classic way to convert string to integer in awk), then stores second column inside an array named smt.
next;
prevents executing the other blocks for file1
{split($6,s,"_");k=s[3]+0}
this block will split the 6th column of file2, convert the id to integer. We could have used " +|_" as delimiter to avoid splitting columns (see first solution).
k in smt{print $0, smt[k]}
will check if the id is a key of smt array. Then it prints line from file2 and second column of file1 (which was stored in smt)
I'm pretty sure that the second solution is more performant but may be greedier in memory.
Both give same output, order of lines may differ :
chr11 5933549 5933577 29 + hsa_smt_028592 682
chr2 131291172 131291197 26 - hsa_smt_000096 648
Upvotes: 1