user1701545
user1701545

Reputation: 6210

Piping output stream with conditions linux

I'm running a program in Linux that's producing an output which I'd like to pipe into another program. However, I'd like to pipe only lines that meet my criteria. Obviously this can be achieved in two steps. However, since I'm talking about millions of lines it's way more efficient to achieve it in one step.

The format of the output stream (it's a sam file format if you're familiar with next generation sequencing) is tab delimited text and consists of two types of lines. One that starts with a "@" character. For examples:

@HD VN:1.0

@SQ SN:ENST00000601705.1 LN:42

@SQ SN:ENST00000602818.1 LN:1099

And another that doesn't and looks like these example lines:

SRR603690.1629913 99 ENST00000440588.2 327 255 76M = 390 139 GCAGATCCTGGACCAGGTTGAGCTGCGCGCAGGCTACCCTCCAGCCATACCCCACAACCTCTCCTGCCTCATGAAC CCCFFFFFHGHHGJIHIHIHIJJJIIF1DGHGIIJIGGHIII@GIIDHIGHHHDFB?ACEDA?(5;@BCCCCCCCA NH:i:20

SRR603690.1629913 99 ENST00000464365.2 2 255 76M = 65 139 GCAGATCCTGGACCAGGTTGAGCTGCGCGCAGGCTACCCTCCAGCCATACCCCACAACCTCTCCTGCCTCATGAAC CCCFFFFFHGHHGJIHIHIHIJJJIIF1DGHGIIJIGGHIII@GIIDHIGHHHDFB?ACEDA?(5;@BCCCCCCCA NH:i:20

What I'm looking for is a command that will only pipe all lines of the first type but from the second type, only lines in which the last field is "NH:i:1".

Without this condition, my piping command looks like this:

> <program1> <program1_arguments> | <program2> <program2_arguments>

(specifically, program1 is an RNA-seq read aligner and program2 is samtools. The output of program1 is a text sam file which I'm piping into samtools to convert it to bam format. Therefore this command looks like this:

> <aligner> reads.fastq | samtools view -bS - > out.bam

)

So I'm looking to add this my conditions to that.

Is this (efficiently) possible?

Upvotes: 0

Views: 68

Answers (1)

Toby Speight
Toby Speight

Reputation: 30910

What you want is a stream processor in your pipe. grep is probably sufficient here:

<aligner> reads.fastq | grep -E '^@|$'  | samtools view -bS - >out.bam

This passes only lines that begin with @ or end with NH:i:1 from one process to the other.

There's no reason that interposing grep like this should be measurably inefficient.

Upvotes: 1

Related Questions