akaDrHouse
akaDrHouse

Reputation: 2240

Trying to combine awk and zcat with multiple filtering criteria

I have very large file (40m x 400 columns).

Structure like:

chr  pos  snp
1   1   rs500
2   4   rs501
2   6   rs502
17   6   rs503

Given a name myfile.gz

To search 3rd column for a given value the following works:

zcat myfile | grep rs500$

However, to search for two criteria - say chr = 17 and pos = 6 I was trying to do the following, but can't get it to return values.

zcat myfile | awk '{ if ($1 == 17 && $2 == 6) print }'

No error, but no return of anything. I've done this kind of filtering in the past when the file wasn't .gz compressed with no issue.

such as this command in a much larger different file that filters two columns on criteria and then retrieves the results.

"awk '{ if (NR == 1 || ($39  >= 0.03 && $36 <= 1e-04)) print }' myfile.notgzcompressed"

But I can't seem to combine that syntax with the need for zcat, because I don't want to have to unzip my huge archive

EDIT to add information based on comments
zcat myfile.gz | head -2 | od -c
0000000   c   h   r  \t   p   o   s  \t   r   e   f  \t   a   l   t  \t
0000020   c   h   r   _   h   g   1   9  \t   p   o   s   _   h   g   1
0000040   9  \t   r   e   f   _   h   g   1   9  \t   a   l   t   _   h
0000060   g   1   9  \t   V   E   P   _   e   n   s   e   m   b   l   _
0000100   s   u   m   m   a   r   y  \t   r   s   _   d   b   S   N   P
0000120   1   5   1  \n   1  \t   1   0   1   8   0  \t   T  \t   C  \t
0000140   1  \t   1   0   1   8   0  \t   T  \t   C  \t   W   A   S   H
0000160   7   P   (   1   )   :   d   o   w   n   s   t   r   e   a   m
0000200   _   g   e   n   e   _   v   a   r   i   a   n   t   (   1   )
0000220   |   D   D   X   1   1   L   1   (   2   )   :   u   p   s   t
0000240   r   e   a   m   _   g   e   n   e   _   v   a   r   i   a   n
0000260   t   (   2   )  \t   r   s   2   0   1   6   9   4   9   0   1
0000300  \n

For more info, I am using R and fread() to pass commands like this so that unix does the parsing prior to loading into the R environment. This chr and pos lookup have been assigned.

fread(cmd = paste0("zcat ", myfile, " | awk ","'{ if ($1  == ", chr ," && $2 == ",pos,") print }'")) -> h2

Upvotes: 0

Views: 173

Answers (1)

Daweo
Daweo

Reputation: 36450

I suspect that whilst using

zcat myfile | awk '{ if ($1 == 17 && $2 == 6) print }'

with humongous myfile problem might arise at |. Namely | has limited machine-dependant capacity (further reading The Pipe Buffer Capacity in Linux), if your awk does not read quickly enough | might become jammed with data.

If your data has never leading zeros and has field separated by single TAB character and you are interesting in 1st field being equal to value and 2nd field being equal to value then you might use GNU grep for that task, 1st field holding 17 and 2nd field holding 6 might be expressed following way, let say you have command which produces TAB-separated output

chr pos snp
1   1   rs500
2   4   rs501
2   6   rs502
17  6   rs503
17  600 rs504

then

command | grep -P --color=never '^17\t6\b'

gives output

17  6   rs503

Explanation: I instruct GNU grep to use perl-flavor regular expression and do not contaminate output with escape sequences and look for lines starting with (^) 17 followed by TAB character, followed by 6 spanning to word boundary (\b) - in order to prevent grabbing lines where 2nd column starts with 6 but is not 6 (observe last line of command output).

(tested in GNU grep 3.7)

Upvotes: 1

Related Questions