Reputation: 343
I have a file which contains several lines as follows:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
I want to extract all lines containing both and including the header lines.
I have tried using grep, but it only extracts the sequence lines but not the header lines.
grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta
How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.
Expected output:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
Upvotes: 2
Views: 1098
Reputation: 26581
You might be interested in BioAwk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ seq1) && ($seq ~ seq2) { print ">"$name; print $seq }' file.fasta
If you want seq1
at the beginning and seq2
at the end, you can change it into:
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ "^"seq1) && ($seq ~ seq2"$") { print ">"$name; print $seq }' file.fasta
This is really practical for processing fasta files, as often the sequence is spread over multiple lines. The above code handles this very easily as the variable $seq
contains the full sequence.
If you do not want to install BioAwk, you can use the following method to process your FASTA file. It will allow multi-line sequences and does the following:
>
in the header, except the first character)name
(not really needed)pattern1
or pattern2
will not fail if the pattern is split over multiple lines.The following awk does the requested:
awk -v seq1="pattern1" -v seq2="pattern2" \
'BEGIN{RS=">"; ORS=""; FS="\n"}
{ seq="";for(i=2;i<=NF;++i) seq=seq""$i; gsub(/[^a-zA-Z0-9]/,"",seq) }
(seq ~ seq1 && seq ~ seq2){print ">" $0}' file.fasta
If the record header contains other >
characters which are not at the beginning of the line, you have to take a slightly different approach (unless you use GNU awk)
awk -v seq1="pattern1" -v seq2="pattern2" \
'/^>/ && (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
/^>/ { seq=""; delete aseq; n=0; name=$0; next }
{ aseq[n++] = $0; seq=seq""$0; sub(/[^a-zA-Z0-9]*$/,"",seq) }
END { if (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
}' file.fasta
note: we make use of sub
here in case unexpected characters are introduced in the fasta file (eg. spaces/tabs or CR
(\r
))
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.
Upvotes: 1
Reputation: 50815
You can easily do that with awk like this:
awk '/^>/{h=$0;next}
/<pattern_1>/&&/<pattern_2>/{print h;print}' input.fasta > output.fasta
And here is a sed solution which yields the desired output as well:
sed -n '/^>/{N;/<pattern_1>/{/<pattern_2>/p}}' input.fasta > output.fasta
If it is likely that multiline records exist, you can use this:
awk -v pat1='<pattern_1>' -v pat2='<pattern_2>' '
/^>/ {r=$0;p=0;next}
!p {r=r ORS $0;if(chk()){print r;p=1};next}
p
function chk( tmp){
tmp=gensub(/\n/,"","g",r)
return (tmp~pat1&&tmp~pat2)
}' input.fasta > output.fasta
Upvotes: 3
Reputation: 12456
If your input file is exactly as described in your post then you can use:
grep -B1 '^<pattern_1>.*<pattern_2>$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
Where -B1
will display on top of the matching lines the line before it. The regex used is based on the hypothesis that your 2 patterns are located in the exact order at the beginning and at the end of the line. If this is not the case: use '.*<pattern_1>.*<pattern_2>.*'
. Last but not least, if the order of the 2 patterns are not always respected then you can use: '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$'
On the following input file:
cat input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
output:
grep -B1 '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>
Upvotes: 1
Reputation: 37464
$ grep -A 1 header[12] file
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
man grep
:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
grep -B 1 pattern_[12]
could work also, but you have several pattern_1
s in the sample data so... not this time.
Upvotes: 3
Reputation: 432
If you want grep to print lines around the match, use the -B flag for lines before, the -A for lines after, and -C for both before and after the match.
In your case, grep -B 1 seems like it would do the job.
Upvotes: 1