extract each sequencing data as individual file

Question

There is a ecoli.ffn file with rows indicating the name of sequencing genes:

$head ecoli.ffn
>ecoli16:g027092:GCF_000460315:gi|545267691|ref|NZ_KE701669.1|:551259-572036
ATGAGCCTGATTATTGATGTTATTTCGCGT
AAAACATCCGTCAAACAAACGCTGATTAAT
>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC

As shown above, the gene name is between the 1st and 2nd colon:

g027092
g000011
g000012

I would like to use ecoli.ffn to generate three files: g027092.txt, g000011.txt,g000012.txt, containing each sequencing data.

For example, g027092.txt will contains the raw data but without the header:

$cat g027092.txt
ATGAGCCTGATTATTGATGTTATTTCGCGT
AAAACATCCGTCAAACAAACGCTGATTAAT

How to make it?

karakfa · Accepted Answer

awk to the rescue!

$ awk -F: -v RS=">" 'NR==FNR{n=split($0,t,"
");
                             for(i=1;i file}' index file


$ head g*.txt
==> g000011.txt <==
GTGTACGCTATGGCGGGTAATTTTGCCGAT


==> g000012.txt <==
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC


==> g027092.txt <==
ATGAGCCTGATTATTGATGTTATTTCGCGT
AAAACATCCGTCAAACAAACGCTGATTAAT

Explanation

NR==FNR{n=sp... block parses the first file and creates a lookup table

$2 in a{file=$2".txt"; if the current record is in the lookup table, set a file name using the key and txt extension

sub(/[^ ]+ /,"") delete the header line

print > file and print to the specified filename.

extract each sequencing data as individual file

Answers (1)

Related Questions