Mariano Avino
Mariano Avino

Reputation: 75

Converting from tsv to fasta

I have a bunch of TSV files in my folder and for everyone one of them I would like to get a fasta file where the header after the sign '>' is the name of the file. My TSV file has 5 columns without header:

Thus:

inputfile called: "A.coseq.table_headless.tsv" HIV1B-pol-seed 15 MAX 1959 GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC output file called "A.fasta"

>A_MAX

GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC

I want to run the script simultaneously in bash for all the files and I have this script who does not work because in awk print statement I have a curly brace:

for sample in `ls *coseq.table_headless.tsv`
do
base1=$(basename $sample "coseq.table_headless.tsv")
awk '{print ">"${base1}"_"$3"\n"$5}' ${base1}coseq.table_headless.tsv > ${base1}fasta

done

Any idea how to correct this code? Thank you very much

Upvotes: 0

Views: 2062

Answers (3)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Another awk solution:

awk '{ pfx=substr(FILENAME,1,index(FILENAME,".")-1); 
       printf(">%s_%s\n%s\n",pfx,$3,$5) > pfx".fasta" }' *coseq.table_headless.tsv 

  • pfx contains the first part of filename (till the 1st .)

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203684

The other solutions posted so far have a few issues:

  1. not closing the files as they're written will produce "too many open files" errors unless you use GNU awk,

  2. calculating the output file name every time a line is read rather than once when the input file is opened is inefficient, and

  3. using parenthesized expression on the right side of output redirection is undefined behavior and so will only work in some awks (including GNU awk).

This will work robustly and efficiently in all awks:

awk '
    FNR==1 { close(out); f=FILENAME; sub(/\..*/,"",f); pfx=">"f"_"; out=f".fasta" }
    { print pfx $3 ORS $5 > out }
' *coseq.table_headless.tsv

Upvotes: 0

karakfa
karakfa

Reputation: 67507

if the basename is the part until the first ".", you can get rid of the loop as well.

 awk '{split(FILENAME,base,"."); 
       print ">" base[1] "_" $3 "\n" $5 > base[1]".fasta"}' *coseq.table_headless.tsv

Upvotes: 2

Related Questions