BioMan
BioMan

Reputation: 704

merge two files based on partial match between strings

I have two files where the string in file1 have partial match to the string in the last column of file2. I would to merge the two files based the match between the strings. How do I solve this when the match is only partial, meaning that the strings in file1 often is a substring of that in file2. PS: Case should be ignored.

file1:

AGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTGGTTATATCCTTCCCGTACTA    0   1   2   3
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT  2   11  14  0
AAAGTGGCCTACGCCACCGCCATGGACTGGTTCATAGCCGTGTGCTATGCCTTC  1   2   3   4
AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC  50  1   1   21
TACCCTGTAGAACCGAANTTGT  0   0   1   4
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG 1   0   4   3
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG 0   1   3   0

file2:

chrX    Rfam    ncRNA               55609165    55609267    53.97   +   0   ID=RF00019.20;Name=RF00019;Alias=Y_RNA;Note=AL627224.14/36063-36164 chrX:55609165-55609267  ggctggtttgagtgcagtgatgcttacaactaattgatcacatccaattacagatttctttgctctttctgtactcccagtgcttcacttgactagccttta
chrX    Rfam    regulatory_region   57233087    57233370    53.02   -   0   ID=RF01417.3;Name=RF01417;Alias=RSV_RNA;Note=Z83745.1/45303-45021 chrX:57233087-57233370    gtaaatgcaaaccattcacagtcttgctcagctaaggggatagtaaagaaacagtcttttaaatcaatgactattaaaggccaatttcttggaatcatagcaggagaaggcagtcctggctgcaatgtccccataggttgtataactgaattaatggctcttaagtcagttaacattctccatttacctgattttttcttaattacaaaaactggagaatttcaaggggaaaatattggaactatgtgtcctttttctaattgttcagtaactaagtcctcta
chrX    Rfam    regulatory_region   61975961    61976233    45.45   -   0   ID=RF01417.4;Name=RF01417;Alias=RSV_RNA;Note=BX322784.3/89124-88853 chrX:61975961-61976233  AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC
chrX    Rfam    ncRNA               62059095    62059167    29.9    +   0   ID=RF00005.18;Name=RF00005;Alias=tRNA;Note=BX119964.4/4840-4911 chrX:62059095-62059167  GTTAATGTAGCTTAATTCATCAAAGCAAGGCACTGAAAAATGCCTAGATGAATACACATGATTCCATTAACA
chrX    Rfam    regulatory_region   62582448    62582735    62.81   -   0   ID=RF01417.5;Name=RF01417;Alias=RSV_RNA;Note=AL158203.12/36753-36467 chrX:62582448-62582735 gtaaacacaaatttttctctgtccttctctgctagatgaatggtataaaaacaatctttaagtcaacaacgattataggccaatcttcaggaattgccacaggggaggggaggacctgttgaagagaccccataggttgcaaattagcattaatagcagttaagtagtgcaaaagtctccatttaccagactttttgggaatgacgaaaatgggcgaattccaaaggctgtttgatggttctatatggccagctttcaattgctcctcaactaattcatgggctctc
chrX    Rfam    ncRNA               63430570    63430868    141.38  +   0   ID=RF00017.15;Name=RF00017;Alias=Metazoa_SRP;Note=AL355852.23/124872-125169 chrX:63430570-63430868  cctggggcagtggcacatgcctgtagtcccagctacttgggaggctgaagcaggaggatagcttaagttcaggagttctgggatgtaatgcactatgctgatagggtgtctgcactaagttcagcatcaacatggtgacctcccaggagcaggggaccaccaggctgcctaaggaggtatgaactggccgagatcagaaacggagcacataaaaacttgcatcttgatcagtagtgggattgcgcctacaaatagccactgcactgcagactgggcaacatagtgagaccttgtctct

Upvotes: 0

Views: 396

Answers (1)

meuh
meuh

Reputation: 12255

If your files arent huge, and awk is able to hold all of file2 in memory, you can do this:

awk  '
ARGIND==1 { save[tolower($NF)] = $0 }
ARGIND==2 { col1 = tolower($1)
     for(pat in save){
      if(pat ~ col1)print $0 " ----- " save[pat]
     }
   }

' file2 file1

This reads file2 first and saves each line ($0) in associative array save, indexed by the last field ($NF) converted to lowercase.

It then reads file1 (so ARGIND is 2, 2nd file), and converts column 1 to lowercase. Then it tries to match (~) this string (or pattern really) against each index in the array. If it matches it prints the current line from file1 and the saved line from file2.

Upvotes: 1

Related Questions