Reputation: 75
I have a bunch of TSV files in my folder and for everyone one of them I would like to get a fasta file where the header after the sign '>' is the name of the file. My TSV file has 5 columns without header:
Thus:
inputfile called: "A.coseq.table_headless.tsv"
HIV1B-pol-seed 15 MAX 1959 GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
output file called "A.fasta"
>A_MAX
GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
I want to run the script simultaneously in bash for all the files and I have this script who does not work because in awk print statement I have a curly brace:
for sample in `ls *coseq.table_headless.tsv`
do
base1=$(basename $sample "coseq.table_headless.tsv")
awk '{print ">"${base1}"_"$3"\n"$5}' ${base1}coseq.table_headless.tsv > ${base1}fasta
done
Any idea how to correct this code? Thank you very much
Upvotes: 0
Views: 2062
Reputation: 92854
Another awk solution:
awk '{ pfx=substr(FILENAME,1,index(FILENAME,".")-1);
printf(">%s_%s\n%s\n",pfx,$3,$5) > pfx".fasta" }' *coseq.table_headless.tsv
pfx
contains the first part of filename (till the 1st .
)Upvotes: 0
Reputation: 203684
The other solutions posted so far have a few issues:
not closing the files as they're written will produce "too many open files" errors unless you use GNU awk,
calculating the output file name every time a line is read rather than once when the input file is opened is inefficient, and
using parenthesized expression on the right side of output redirection is undefined behavior and so will only work in some awks (including GNU awk).
This will work robustly and efficiently in all awks:
awk '
FNR==1 { close(out); f=FILENAME; sub(/\..*/,"",f); pfx=">"f"_"; out=f".fasta" }
{ print pfx $3 ORS $5 > out }
' *coseq.table_headless.tsv
Upvotes: 0
Reputation: 67507
if the basename is the part until the first ".", you can get rid of the loop as well.
awk '{split(FILENAME,base,".");
print ">" base[1] "_" $3 "\n" $5 > base[1]".fasta"}' *coseq.table_headless.tsv
Upvotes: 2