FatBerg
FatBerg

Reputation: 137

Trying to awk a file, but cannot allocate sufficient memory. Any alternatives or adjustments?

The file in question is a pileup file from RNAseq. I want to extract information on one chromosome. This has worked for smaller files:

awk '/chrM/ { print }' file1.pileup > file1.chrm.pileup

The error code:

awk: (FILENAME=file1.pileup FNR=1743118775) fatal: grow_iop_buffer: iop->buf: can't allocate 137438953474 bytes of memory (Cannot allocate memory)

Is there an alternative command, or a sub-command to circumvent this?

Thanks for any help.

Edit:

Data looks like this:

chr1    258755  T       1       .                 F
chr1    258756  C       1       ......            F
chr1    258757  T       1       ...               H
chr1    258758  A       1       ...........       H

It is 3529769718150 bytes.

I expect to find (basically a bunch of rows between ~70-75% of the way down):

chrM    6432       C       1       ^~.            B
chrM    7294       A       1       ........       B
chrM    7296       G       1       .....          B

Edit2:

Output of 'head -n 1 File1 | od -c':

0000000   c   h   r   1  \t   2   5   8   7   4   9  \t   T  \t   1  \t
0000020   ^   ~   .  \t   C  \n
0000026

Output of 'head -c xxx File1 | od -c':

head: xxx: invalid number of bytes
0000000

Output of 'head -c 100 File1 | od -c':

0000000   c   h   r   1  \t   2   5   8   7   4   9  \t   T  \t   1  \t
0000020   ^   ~   .  \t   C  \n   c   h   r   1  \t   2   5   8   7   5
0000040   0  \t   T  \t   1  \t   .  \t   C  \n   c   h   r   1  \t   2
0000060   5   8   7   5   1  \t   T  \t   1  \t   G  \t   C  \n   c   h
0000100   r   1  \t   2   5   8   7   5   2  \t   T  \t   1  \t   .  \t
0000120   F  \n   c   h   r   1  \t   2   5   8   7   5   3  \t   C  \t
0000140   1  \t   .  \t
0000144

Upvotes: 2

Views: 733

Answers (3)

Kyle Banerjee
Kyle Banerjee

Reputation: 2794

It sounds like your grep command might not be able to deal with files larger than 2.4 GB because the 32 bit pointer can't access them.

Try running

split --line-bytes=2GB file1.pileup

This will split your file into two pieces that you should be able to process as you'd like.

Upvotes: 2

glenn jackman
glenn jackman

Reputation: 246942

I wonder if you'll have better success avoiding regular expressions:

awk '$1 == "chrM"' file1.pileup > file1.chrm.pileup

I wonder if your file got "corrupted", and somewhere in the file there's one line that is 137438953474 bytes long. Can you try this:

awk '{print NR, NF, length($0)}' file1.pileup > file1.line_lengths

And see where it craps out?

Upvotes: 0

anubhava
anubhava

Reputation: 785316

You can just use grep -F (fixed text search) here instead of awk:

grep -wF 'chrM' file1.pileup > file1.chrm.pileup

If you really want to use awk then faster & shorter command would be avoiding regex:

awk 'index($0, "chrM")' file1.pileup > file1.chrm.pileup

Upvotes: 0

Related Questions