Reputation: 23
I need to split a huge file (about 4 million lines) in subfiles based on a pattern.
I always use awk to do that and works perfectly in files until about a hundred thousand lines. Files bigger than that are returning the following error:
awk: cannot open "filename" for output (Too many open files)
Here the command line that I'm using:
awk '{OFS="\t"; print $1,$2,$3,$4,$12 > $10"_"$8.txt"}' mybigfile.txt
In $10
there are about 4 or 5 thousand different patterns in which I need to split into.
How can I overcome this error? Where should I insert the close
command?
(I'm using the awk in the Ubuntu distribution.)
Upvotes: 2
Views: 1518
Reputation: 133610
So whenever you are creating a new file by values of $10, $8 then it will write the lines into it but in backend since awk program is still running it will not close those files and which will cause the limit of open files by this awk program thus we have to close those files.
Kindly try following and let me know if this helps you.
awk 'BEGIN{OFS="\t";} {if(prev){close(prev)};print $1,$2,$3,$4,$12 >> ($10"_"$8".txt");prev=$10"_"$8".txt"}' mybigfile.txt
Upvotes: 1
Reputation: 203985
Copy/paste exactly this command and it will work:
awk 'BEGIN{OFS="\t"} {out=$10"_"$8".txt"; print $1,$2,$3,$4,$12 >> out; close(out)}' mybigfile.txt
You've been experiencing 2 problems:
1) You're using an awk that is not GNU awk and so doesn't close files for you when needed, and
2) You're re-typing the commands people are suggesting you use instead of copy-pasting them and messing up the quotes when you do so, just like in the script in your question.
If you can use gawk then it'd simply be:
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$12 > ($10"_"$8".txt")}' mybigfile.txt
Unlike with several other awks you don't technically need to parenthesize the expression on the right side of output redirection with gawk but it's a good habit to get into for portability and helps readability.
Upvotes: 2