Alex Trevylan
Alex Trevylan

Reputation: 545

How do I join a fasta file and a txt file?

I have a fasta file that looks something like this:

> ASst1|LK||eukaryota|Homo sapiens
YYNRLINTLLDNGIEPIVSIYHWDLPQRLQDLGGWPNIVLAIYTENYARVLFKNFGDRVK
LWITFNEPRIFMGGYTSDTGMAPSINTPGIGDYLTSRTVLIAHANIYHMYEREFKQQQKG
KIGITLTGFWCEPLTPDFTERCERYQQFQLGLYAHPIFTGHGDYPSVVIERVDNNSKVEG
FTTSRLPKLTSEEVNYIKGTYDFFGINFYTAQVGLNGVVGGIPSRERDMGTIVLQDPNWP
> >ASstj1|TH1||eukaryota|Mus musculus 
FWLVVSQLLYFPRDAHCLADIPSEAILDNNIPLINNLTFPDGFLFGAATAAYQIEGAWN
VDGKGPSIWDEFTHTHPEIITDHSTGDDACKSYYKYKEDVQAAKTMGLDSYRFSMSWPRI
MPTGFPDNINQKGIDYYNNLINELVDNGIMPLVTMYHWDLPQNLQTYGGWLNESIVPLYV
SYARVLFENFGDRVKWWLTFNEPQFVSLGYEFRVMAPGIFTNGTGPYIASTNVLKAHA

I have another file with information:

Homo sapiens    9606    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 

Mus musculus    10090   cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Glires;Rodentia;Myomorpha;Muroidea;Muridae;Murinae;Mus;Mus;Mus musculus

I want to jin the two files, such that it looks like the following:

> ASst1|LK||eukaryota|Homo sapiens cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
YYNRLINTLLDNGIEPIVSIYHWDLPQRLQDLGGWPNIVLAIYTENYARVLFKNFGDRVK
LWITFNEPRIFMGGYTSDTGMAPSINTPGIGDYLTSRTVLIAHANIYHMYEREFKQQQKG
KIGITLTGFWCEPLTPDFTERCERYQQFQLGLYAHPIFTGHGDYPSVVIERVDNNSKVEG
FTTSRLPKLTSEEVNYIKGTYDFFGINFYTAQVGLNGVVGGIPSRERDMGTIVLQDPNWP
> >ASstj1|TH1||eukaryota|Mus musculus cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Glires;Rodentia;Myomorpha;Muroidea;Muridae;Murinae;Mus;Mus;Mus musculus
FWLVVSQLLYFPRDAHCLADIPSEAILDNNIPLINNLTFPDGFLFGAATAAYQIEGAWN
VDGKGPSIWDEFTHTHPEIITDHSTGDDACKSYYKYKEDVQAAKTMGLDSYRFSMSWPRI
MPTGFPDNINQKGIDYYNNLINELVDNGIMPLVTMYHWDLPQNLQTYGGWLNESIVPLYV
SYARVLFENFGDRVKWWLTFNEPQFVSLGYEFRVMAPGIFTNGTGPYIASTNVLKAHA

I was thinking, in this example, join would not work. It would work if I first parsed the header into a separate list i.e. grep >, and then joined the two files. But I really need the sequence printed below. Any thoughts would be most helpful.

Upvotes: 0

Views: 30

Answers (1)

mklement0
mklement0

Reputation: 438123

Try the following:

awk -F'[\t|]' '
  FNR==NR { dict[$1]=$3; next }
  /^> / { $0 = $0 " " dict[$NF] }
  { print }
' fileLookup fileFasta

Assumptions:

  • Your lookup file is tab-separated.

  • The trailing space after Mus musculus in the fasta file sample isn't in the real data file.

Upvotes: 1

Related Questions