FiestaJ
FiestaJ

Reputation: 13

Rename file using fasta header

I have multiple fasta files downloaded from NCBI and want to rename them with some part of the header:

Example of the header: >KY705281.1 Streptococcus phage P7955, complete genome
Example of filename: KY705281.fasta

The idea is to get rid of 'KY705281.1' and 'complete genome' so that only Streptococcus phage P7955 remain

For example, one input file will be:

>KY705281.1 Streptococcus phage P7955, complete genome
AGAAAGAAAAGACGGCTCATTTGTGGGTTGTCTTTTTTTGATTAAGTAATGAAGGAGGTGGATGTATTGG GCTAAATCAACGACAAAAACGATTTGCAGACGAATATTTGATATCTGGTGTCGCTTACAATGCAGCTATC AAAGCTGGGTATTCTGAGAAATACGCTAGAGCAAGAAGTCATACCTTGTTGGAAAATGTCGGCAT

It wlil be renamed to KY705281.fasta with content:

>Streptococcus phage P7955 
AGAAAGAAAAGACGGCTCATTTGTGGGTTGTCTTTTTTTGATTAAGTAATGAAGGAGGTGGATGTATTGG GCTAAATCAACGACAAAAACGATTTGCAGACGAATATTTGATATCTGGTGTCGCTTACAATGCAGCTATC AAAGCTGGGTATTCTGAGAAATACGCTAGAGCAAGAAGTCATACCTTGTTGGAAAATGTCGGCAT

I'm a newbie with Linux but somehow with some Google search, I know that this could be done easily with some awk/sed/grep commands.
Any advice would be grateful

Upvotes: 0

Views: 1059

Answers (1)

Tyl
Tyl

Reputation: 5252

One way could be:

awk -F, 'FNR==1{match($1, "^>([^.]+)[^ ]+ (.*)", oFv); $1= ">" oFv[2]; sub(/ *complete genome */, "", $2);}{printf $0>oFv[1] ".fasta"}' somefiles*

This will keep old files and write corresponding new file(s).
Also this assume that the input files only have one line like you gave.

If you want to rename old files as well as change their contents,
Given your system and bash, also I think it's GNU awk & GNU sed,
please backup your files and try this:

#!/usr/bin/bash
for file in somefiles*; do
    nn="$(awk -F[\>.] '{printf $2 ".fasta";exit}' "file")"
    sed -ri '1{s/^[^ ]* />/;s/, complete genome//;}' "file"
    if [ ! -f "$nn"];
    then
        mv "file" "nn"
    else
        echo "'$nn' exists, skip '$file', its content already changed." | tee _err_.log
    fi
done

Or as oneliner:

for file in somefiles*; do nn="$(awk -F[\>.] '{printf $2 ".fasta";exit}' "$file")"; sed -ri '1{s/^[^ ]* />/;s/, complete genome//;}' "$file"; if [ ! -f "$nn" ]; then mv "$file" "$nn"; else echo "'$nn' exists, skip '$file', its content already changed." | tee _err_.log; fi; done

Upvotes: 0

Related Questions