Reputation: 704
I have two files where the string in file1 have partial match to the string in the last column of file2. I would to merge the two files based the match between the strings. How do I solve this when the match is only partial, meaning that the strings in file1 often is a substring of that in file2. PS: Case should be ignored.
file1:
AGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTGGTTATATCCTTCCCGTACTA 0 1 2 3
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT 2 11 14 0
AAAGTGGCCTACGCCACCGCCATGGACTGGTTCATAGCCGTGTGCTATGCCTTC 1 2 3 4
AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC 50 1 1 21
TACCCTGTAGAACCGAANTTGT 0 0 1 4
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG 1 0 4 3
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG 0 1 3 0
file2:
chrX Rfam ncRNA 55609165 55609267 53.97 + 0 ID=RF00019.20;Name=RF00019;Alias=Y_RNA;Note=AL627224.14/36063-36164 chrX:55609165-55609267 ggctggtttgagtgcagtgatgcttacaactaattgatcacatccaattacagatttctttgctctttctgtactcccagtgcttcacttgactagccttta
chrX Rfam regulatory_region 57233087 57233370 53.02 - 0 ID=RF01417.3;Name=RF01417;Alias=RSV_RNA;Note=Z83745.1/45303-45021 chrX:57233087-57233370 gtaaatgcaaaccattcacagtcttgctcagctaaggggatagtaaagaaacagtcttttaaatcaatgactattaaaggccaatttcttggaatcatagcaggagaaggcagtcctggctgcaatgtccccataggttgtataactgaattaatggctcttaagtcagttaacattctccatttacctgattttttcttaattacaaaaactggagaatttcaaggggaaaatattggaactatgtgtcctttttctaattgttcagtaactaagtcctcta
chrX Rfam regulatory_region 61975961 61976233 45.45 - 0 ID=RF01417.4;Name=RF01417;Alias=RSV_RNA;Note=BX322784.3/89124-88853 chrX:61975961-61976233 AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC
chrX Rfam ncRNA 62059095 62059167 29.9 + 0 ID=RF00005.18;Name=RF00005;Alias=tRNA;Note=BX119964.4/4840-4911 chrX:62059095-62059167 GTTAATGTAGCTTAATTCATCAAAGCAAGGCACTGAAAAATGCCTAGATGAATACACATGATTCCATTAACA
chrX Rfam regulatory_region 62582448 62582735 62.81 - 0 ID=RF01417.5;Name=RF01417;Alias=RSV_RNA;Note=AL158203.12/36753-36467 chrX:62582448-62582735 gtaaacacaaatttttctctgtccttctctgctagatgaatggtataaaaacaatctttaagtcaacaacgattataggccaatcttcaggaattgccacaggggaggggaggacctgttgaagagaccccataggttgcaaattagcattaatagcagttaagtagtgcaaaagtctccatttaccagactttttgggaatgacgaaaatgggcgaattccaaaggctgtttgatggttctatatggccagctttcaattgctcctcaactaattcatgggctctc
chrX Rfam ncRNA 63430570 63430868 141.38 + 0 ID=RF00017.15;Name=RF00017;Alias=Metazoa_SRP;Note=AL355852.23/124872-125169 chrX:63430570-63430868 cctggggcagtggcacatgcctgtagtcccagctacttgggaggctgaagcaggaggatagcttaagttcaggagttctgggatgtaatgcactatgctgatagggtgtctgcactaagttcagcatcaacatggtgacctcccaggagcaggggaccaccaggctgcctaaggaggtatgaactggccgagatcagaaacggagcacataaaaacttgcatcttgatcagtagtgggattgcgcctacaaatagccactgcactgcagactgggcaacatagtgagaccttgtctct
Upvotes: 0
Views: 396
Reputation: 12255
If your files arent huge, and awk is able to hold all of file2 in memory, you can do this:
awk '
ARGIND==1 { save[tolower($NF)] = $0 }
ARGIND==2 { col1 = tolower($1)
for(pat in save){
if(pat ~ col1)print $0 " ----- " save[pat]
}
}
' file2 file1
This reads file2 first and saves each line ($0) in associative array save
, indexed by the last field ($NF) converted to lowercase.
It then reads file1 (so ARGIND is 2, 2nd file), and converts column 1 to lowercase. Then it tries to match (~) this string (or pattern really) against each index in the array. If it matches it prints the current line from file1 and the saved line from file2.
Upvotes: 1