Reputation: 446
I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
Upvotes: 4
Views: 800
Reputation: 203229
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.
Upvotes: 0
Reputation: 161614
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut
to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' '
to whatever the field separator is. The default is $'\t'
.
Upvotes: 2
Reputation: 1631
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
Upvotes: 0
Reputation: 5412
Seems to me that your awk
implementation has an upper limit for the number of records it can read in one go of 117,897,124
. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split
to split the large file into smaller ones, with no more than 100,000,000
records each.
Just in case that you don't want to split the file, then maybe you could look for the limits
file correspondent to your awk
implementation. Maybe you can define unlimited
as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
Upvotes: 2
Reputation: 12514
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk
and grep
steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk
problem entirely (that sed
command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Upvotes: 2