Jovan Andonov
Jovan Andonov

Reputation: 446

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.

awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt

This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:

awk: program limit exceeded: maximum number of fields size=32767
    FILENAME="file_name" FNR=117897124 NR=117897124

Any suggestions?

Upvotes: 4

Views: 800

Answers (5)

Ed Morton
Ed Morton

Reputation: 203229

The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:

awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name

If that's a problem then try:

awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u

There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

Upvotes: 0

kev
kev

Reputation: 161614

The error message tells you:

line(117897124) has to many fields (>32767).

You'd better check it out:

sed -n '117897124{p;q}' file_name

Use cut to extract 1st column:

cut -d ' ' -f 1 < file_name | ...

Note: You may change ' ' to whatever the field separator is. The default is $'\t'.

Upvotes: 2

Adrian B
Adrian B

Reputation: 1631

If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim

Upvotes: 0

jimm-cl
jimm-cl

Reputation: 5412

Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.

Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.


Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...

Upvotes: 2

Norman Gray
Norman Gray

Reputation: 12514

The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.

I suspect that the awk and grep steps could be combined into one:

sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt

That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).

Upvotes: 2

Related Questions