Reputation: 2240
I have very large file (40m x 400 columns).
Structure like:
chr pos snp
1 1 rs500
2 4 rs501
2 6 rs502
17 6 rs503
Given a name myfile.gz
To search 3rd column for a given value the following works:
zcat myfile | grep rs500$
However, to search for two criteria - say chr = 17
and pos = 6
I was trying to do the following, but can't get it to return values.
zcat myfile | awk '{ if ($1 == 17 && $2 == 6) print }'
No error, but no return of anything. I've done this kind of filtering in the past when the file wasn't .gz compressed with no issue.
such as this command in a much larger different file that filters two columns on criteria and then retrieves the results.
"awk '{ if (NR == 1 || ($39 >= 0.03 && $36 <= 1e-04)) print }' myfile.notgzcompressed"
But I can't seem to combine that syntax with the need for zcat, because I don't want to have to unzip my huge archive
EDIT to add information based on comments
zcat myfile.gz | head -2 | od -c
0000000 c h r \t p o s \t r e f \t a l t \t
0000020 c h r _ h g 1 9 \t p o s _ h g 1
0000040 9 \t r e f _ h g 1 9 \t a l t _ h
0000060 g 1 9 \t V E P _ e n s e m b l _
0000100 s u m m a r y \t r s _ d b S N P
0000120 1 5 1 \n 1 \t 1 0 1 8 0 \t T \t C \t
0000140 1 \t 1 0 1 8 0 \t T \t C \t W A S H
0000160 7 P ( 1 ) : d o w n s t r e a m
0000200 _ g e n e _ v a r i a n t ( 1 )
0000220 | D D X 1 1 L 1 ( 2 ) : u p s t
0000240 r e a m _ g e n e _ v a r i a n
0000260 t ( 2 ) \t r s 2 0 1 6 9 4 9 0 1
0000300 \n
For more info, I am using R and fread() to pass commands like this so that unix does the parsing prior to loading into the R environment. This chr and pos lookup have been assigned.
fread(cmd = paste0("zcat ", myfile, " | awk ","'{ if ($1 == ", chr ," && $2 == ",pos,") print }'")) -> h2
Upvotes: 0
Views: 173
Reputation: 36450
I suspect that whilst using
zcat myfile | awk '{ if ($1 == 17 && $2 == 6) print }'
with humongous myfile
problem might arise at |
. Namely |
has limited machine-dependant capacity (further reading The Pipe Buffer Capacity in Linux), if your awk
does not read quickly enough |
might become jammed with data.
If your data has never leading zeros and has field separated by single TAB character and you are interesting in 1st field being equal to value and 2nd field being equal to value then you might use GNU grep
for that task, 1st field holding 17
and 2nd field holding 6
might be expressed following way, let say you have command
which produces TAB-separated output
chr pos snp
1 1 rs500
2 4 rs501
2 6 rs502
17 6 rs503
17 600 rs504
then
command | grep -P --color=never '^17\t6\b'
gives output
17 6 rs503
Explanation: I instruct GNU grep to use perl-flavor regular expression and do not contaminate output with escape sequences and look for lines starting with (^
) 17
followed by TAB character, followed by 6
spanning to word boundary (\b
) - in order to prevent grabbing lines where 2nd column starts with 6
but is not 6
(observe last line of command
output).
(tested in GNU grep 3.7)
Upvotes: 1