Apex
Apex

Reputation: 1096

Remove pattern and everything before using AWK in fasta file

I searched a lot but could not find a solution to my issue. I have a file that looks like:

>HEADER1
AACTGGTTACGTGGTTCTCT
>HEADER2
GGTTTCTC
>HEADER3
CCAGGTTTCGAGGGGTTACGGGGTA

I want to remove GGTT pattern and everything before it. So basically there are several of these patterns in some of the lines so I want to remove all of them including everything before the pattern or among them.

The desired output should look like:

>HEADER1
CTCT
>HEADER2
TCTC
>HEADER3
ACGGGGTA

I tried suggested ways but could not be able to adjust it to my data.

Thank you in Advance for your help.

Upvotes: 2

Views: 1107

Answers (2)

Timur Shtatland
Timur Shtatland

Reputation: 12347

Note that in general, sequences in fasta format as shown in the question may span multiple lines (= they are often wrapped to 80 or 100 nucleotides per line). This answer handles such cases correctly as well, unlike some other answers in this thread.

Use these two Perl one-liners connected by a pipe. The first one-liner does all of the common reformatting of the fasta sequences that is necessary in this and similar cases. It removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. It also properly handles leading and trailing whitespace/newlines in the file. The second one-liner actually removes everything up to and including the last GGTT in the sequence, in a case-insensitive manner.

Note: If GGTT is at the end of the sequence, the output will be a header plus an empty sequence. See seq4 in the example below. This may cause issues with some bioinformatics tools used downstream.

# Create the input for testing:

cat > in.fa <<EOF

>seq1 with blanks
ACGT GGTT ACGT
>seq2 with newlines
ACGT

GGTT

ACGT

>seq3 without blanks or newlines
ACGTGGTTACGT

>seq4 everything should be deleted, with empty sequence in the output
ACGTGGTTACGTGGTT

>seq5 lowercase
acgtggttacgt

EOF

# Reformat to single-line fasta, then delete subsequences:

perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' in.fa | \
  perl -pe 'next if /^>/; s/.*GGTT//i;' > out.fa

Output in file out.fa:

>seq1 with blanks
ACGT
>seq2 with newlines
ACGT
>seq3 without blanks or newlines
ACGT
>seq4 everything should be deleted, with empty sequence in the output

>seq5 lowercase
acgt

The Perl one-linera use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.

chomp : Remove the input line separator (\n on *NIX).
if ( /^>/ ) : Test if the current line is a sequence header line.
$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; } : Print the final newline after the last sequence.
s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.

next if /^>/; : Skip the header lines. s/.*GGTT//i; : Replace everything (.*) up to and including the the last GGTT with nothing (= delete it). The /i modifier means case-insensitive match.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups

Remove line breaks in a FASTA file

Upvotes: 2

jas
jas

Reputation: 10865

If it's not possible for your headers to include GGTT, I suppose the easiest would be:

$ sed 's/.*GGTT//' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA

If your headers might contain GGTT, then awk probably be better:

$ awk '!/^>/ {sub(/.*GGTT/, "")}1' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA

In both cases, the .*GGTT is "greedy", so it doesn't matter if there are multiple instances of GGTT, it will always match up to and remove everything through the last occurrence.

In the awk version, the pattern !/^>/ makes sure that subsitution is only done on lines that do not start with >.

Upvotes: 4

Related Questions