Reputation: 13
I have multiple fasta files downloaded from NCBI and want to rename them with some part of the header:
Example of the header: >KY705281.1 Streptococcus phage P7955, complete genome
Example of filename: KY705281.fasta
The idea is to get rid of 'KY705281.1'
and 'complete genome'
so that only Streptococcus phage P7955 remain
For example, one input file will be:
>KY705281.1 Streptococcus phage P7955, complete genome
AGAAAGAAAAGACGGCTCATTTGTGGGTTGTCTTTTTTTGATTAAGTAATGAAGGAGGTGGATGTATTGG GCTAAATCAACGACAAAAACGATTTGCAGACGAATATTTGATATCTGGTGTCGCTTACAATGCAGCTATC AAAGCTGGGTATTCTGAGAAATACGCTAGAGCAAGAAGTCATACCTTGTTGGAAAATGTCGGCAT
It wlil be renamed to KY705281.fasta
with content:
>Streptococcus phage P7955
AGAAAGAAAAGACGGCTCATTTGTGGGTTGTCTTTTTTTGATTAAGTAATGAAGGAGGTGGATGTATTGG GCTAAATCAACGACAAAAACGATTTGCAGACGAATATTTGATATCTGGTGTCGCTTACAATGCAGCTATC AAAGCTGGGTATTCTGAGAAATACGCTAGAGCAAGAAGTCATACCTTGTTGGAAAATGTCGGCAT
I'm a newbie with Linux but somehow with some Google search, I know that this could be done easily with some awk/sed/grep commands.
Any advice would be grateful
Upvotes: 0
Views: 1059
Reputation: 5252
One way could be:
awk -F, 'FNR==1{match($1, "^>([^.]+)[^ ]+ (.*)", oFv); $1= ">" oFv[2]; sub(/ *complete genome */, "", $2);}{printf $0>oFv[1] ".fasta"}' somefiles*
This will keep old files and write corresponding new file(s).
Also this assume that the input files only have one line like you gave.
If you want to rename old files as well as change their contents,
Given your system and bash, also I think it's GNU awk & GNU sed,
please backup your files and try this:
#!/usr/bin/bash
for file in somefiles*; do
nn="$(awk -F[\>.] '{printf $2 ".fasta";exit}' "file")"
sed -ri '1{s/^[^ ]* />/;s/, complete genome//;}' "file"
if [ ! -f "$nn"];
then
mv "file" "nn"
else
echo "'$nn' exists, skip '$file', its content already changed." | tee _err_.log
fi
done
Or as oneliner:
for file in somefiles*; do nn="$(awk -F[\>.] '{printf $2 ".fasta";exit}' "$file")"; sed -ri '1{s/^[^ ]* />/;s/, complete genome//;}' "$file"; if [ ! -f "$nn" ]; then mv "$file" "$nn"; else echo "'$nn' exists, skip '$file', its content already changed." | tee _err_.log; fi; done
Upvotes: 0