ShannonTown
ShannonTown

Reputation: 29

How to extract sequence from a fastq file and save as new file for each sequence

I have a fastq file in which the first 8 lines look like this:

@SRR21388627.2845086/1
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
+
FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FF
@SRR21388627.2707233/1
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
+
FFFF:FF,:FFFF,FF,F:FFFFF:,F,,:,FF:,:,FFF:::F:,,FFF:::,FF:::

I would like to take the DNA sequence and save each sequence as a new file named with the line before the sequence, such as "SRR21388627.2845086.1.fq", where the @ is removed, and the / is replaced with .

So far I came up with a command with reference from this post, which doesn't work yet, because I am not sure how to remove @ and replace / in awk

cat deltaQ_1_region_1.fq | paste - - - - | cut -f1,2 | 
awk -F'\t' '$1!=prev{close(out); out=$1".fq"; prev=$1} {sub(/[^\t]+\t/,""); print > out}' file

The expected output for SRR21388627.2845086.1.fq is:

GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG

Upvotes: 1

Views: 233

Answers (3)

Daweo
Daweo

Reputation: 36700

Note: this answer show how to get desired result with minimal alteration of existing code. Generally it is not necessary to use cut and awk in one pipeline as 2nd can easily handle columns and often cut part might be eliminated, as shown in other answers.

how to remove @ and replace / in awk

sub String Function can be used for that purpose. As you store filename in variable named out, you should pass that as 3rd (optional) argument to sub functions so it changes out, namely replace this action

{close(out); out=$1".fq"; prev=$1}

using

{close(out); out=$1".fq"; sub(/@/,"",out); sub(/\//,".",out); prev=$1}

Upvotes: 2

Fravadona
Fravadona

Reputation: 17216

You have 4 lines per "record" so you can use a modulo 4 to differentiate the "fields" in awk (instead of joining them with a paste before calling awk):

awk '
    NR%4 == 1 { sub("/","."); file_out = substr($0,2) ".fq"; next }
    NR%4 == 2 { print > file_out; close(file_out) }
' file.fq

Upvotes: 4

markp-fuso
markp-fuso

Reputation: 35256

Assuming the number of lines that make up a 'DNA sequence' could vary:

awk '
/^@/ { close(outfile)         # close previous output file
       gsub(/@/,"")           # remove all "@" characters from current line
       gsub(/[/]/,".")        # replace all "/" with "." on current line 
       outfile = $0 ".fq"     # define new output file name
       next                   # skip to next line of input
     }
     { print > outfile }      # write current line to output file
' deltaQ_1_region_1.fq

Results:

$ head SRR*fq
==> SRR21388627.27072331.fq <==
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
+
FFFF:FF,:FFFF,FF,F:FFFFF:,F,,:,FF:,:,FFF:::F:,,FFF:::,FF:::

==> SRR21388627.28450861.fq <==
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
+
FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FF

Upvotes: 3

Related Questions