malcatraz
malcatraz

Reputation: 3

Replacing a string in one file, with the contents of another file based on a common string

I have two files. I would like to replace a certain string in file 1, with the contents of file 2 based on a common string.

file 1

   Chr5 psl2gff exon    15907715    15907933    .   +   .   NM_001046410
   Chr2 psl2gff exon    8898358     8898394     .   +   .   NM_001192190

file 2

NM_001046410 gene_id TUBA1D; transcript_id tubulin, alpha 3d
NM_001192190 gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1

output

  Chr5  psl2gff exon    15907715    15907933    .   +   .   gene_id TUBA1D; transcript_id tubulin, alpha 3d
  Chr2  psl2gff exon    8898358     8898394     .   +   .   gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1

in file 1 there are multiple instances of the same string, however, file 2 only has it once. I would like all instances of the NM_**** etc. to be replaced by the contents of file 2 when the first column matches. following this, I would like to completely remove the NM_**** from the file.

I am very new to bash etc. I have looked all over the place for a way to do this, but none so far have worked. Also, there are over 5000 lines in file 2, many more in file 1.

Any help would be much appreciated!

Thanks.

Upvotes: 0

Views: 224

Answers (1)

karakfa
karakfa

Reputation: 67467

this is a join operation. If the files are sorted on the join key, and if the white space is not significant the easiest will be

$ join -19 -21 file1 file2 | cut -d' ' -f2-

Chr5 psl2gff exon 15907715 15907933 . + . gene_id TUBA1D; transcript_id tubulin, alpha 3d
Chr2 psl2gff exon 8898358 8898394 . + . gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1

if the files are not sorted and white space is important awk will be a better solution

$ awk 'NR==FNR  {k=$1; $1=""; a[k]=$0; next} 
       $NF in a {sub(FS $NF"$",a[$NF])}1' file2 file1 

   Chr5 psl2gff exon    15907715    15907933    .   +   .  gene_id TUBA1D; transcript_id tubulin, alpha 3d
   Chr2 psl2gff exon    8898358     8898394     .   +   .  gene_id BOD1L1; transcript_id biorientation of chromosomes in cell division 1 like 1

exercise for you is to understand the code. There are many examples (>100) on this site exactly for this question and with many commented scripts, some of which are written by me.

Upvotes: 1

Related Questions