Anton Ivankin
Anton Ivankin

Reputation: 47

Sed. remove multiline patterns. RegExp

I have fastq file with strict formatting.

Input file:

@HWI-ST383:199:D1L73ACXX:3:1101:1309:1956 1:N:0:ACAGTGA 
+ 
JJJHIIJFIJJJJ=BFFFFFEEEEEEDDDDDDDDDDBD 
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA 
+ 
IIIIFFF<?6?FAFEC@=C@1AE############### 

In my last question I solved my problem. But I do not correctly understand the file format. I need to get from input file this file:

output:

@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA 
+ 
IIIIFFF<?6?FAFEC@=C@1AE###############

Where I remove read who isn't contain sequence.

This script works correctly. But i don't write regular expression to get what you want

awk '/\n[GATC]*\n/' RS=+ ORS=+

after script work I expected to see this output file. By this link you will see expression that describes the rows that I want to delete.

Upvotes: 1

Views: 1705

Answers (3)

Ed Morton
Ed Morton

Reputation: 203684

It sounds like all you need is:

$ awk -v RS= '{gsub(/(^|\n)@[^\n]+\n\+\n[^\n]+\n/,"")}1' file
@HWI-ST383:199:D1L73ACXX:3:1101:3437:1952 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA
+
IIIIFFF<?6?FAFEC@=C@1AE###############

i.e. just delete any string that starts with "@" ((^|\n)@) then a bunch of non-newline chars ([^\n]+) then a + between newlines (\n\+\n) then a bunch of other non-newline chars terminated with a newline ([^\n]+\n). If any lines can have leading or trailing whitespace then just throw a [[:blank:]]* in wherever the white space could occur.

Upvotes: 0

Wintermute
Wintermute

Reputation: 44043

sed '/^@H/ { N; /\n+$/ { N; d } }' filename

This works as follows:

/^@H/ {     # if the current line begins with @H
  N         # fetch the next one, append it.
  /\n+$/ {  # if the combined pattern has \n+ at the end (that is, if the new 
            # line is "+")
    N       # fetch another line
    d       # and discard the lot.
  }
}

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174726

Through perl.

$ perl -0777pe 's/[GATC]+\h*\n\K\+.*?[GATC]+\n//gs' file
@HWI-ST383:199:D1L73ACXX:3:1101:1309:1956 1:N:0:ACAGTGA 
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA 
+ 
IIIIFFF<?6?FAFEC@=C@1AE############### 

But this leaves the spaces at the last untouched. If you want to remove also the trailing spaces then try the below.

$ perl -0777pe 's/[GATC]+\K\h*\n\+.*?[GATC]+\n/\n/gs' file
@HWI-ST383:199:D1L73ACXX:3:1101:1309:1956 1:N:0:ACAGTGA
GATCTCGAAGCAAGAGTACGACGAGTCGGGCCCCTCCA 
+ 
IIIIFFF<?6?FAFEC@=C@1AE############### 

Upvotes: 0

Related Questions