Sergey Bushmanov
Sergey Bushmanov

Reputation: 25199

Awk fails to split a big file (10Gb+)

I am trying to split a big (10Gb+) text file on a prespecified number of empty lines with the following script:

awk 'BEGIN {nParMax = 100000; npar = 0 ;nFile =0}
     /^$/{npar++;if(npar==nParMax){nFile++;npar=0;next}}
     {print $0 > "split_"nFile".out"}'  fname

Script works fine as it's expected on files below 1Gb, but when I run it on the bigger file, the end of the file is splitted absolutely at a random location ("random" means I do not understand why it splits at this place. It might be (i) end of first field for one split, or (ii) middle of field for another split, or (iii) middle of commented line for another. Though, if I repeat splitting experiments, awk always split at the same location as far as I can tell).

The rest of the "randomly" split para is lost. New split always starts cleanly, from the line following the split (empty line).

Example of a last paragraph with special character turned on:

# sent_id = 170247_3$
# text = В то же время видеокадры с места событий свидетельствуют о том, что после звука, похожего на выстрел, находившихся на площади людей охватила паника.$
1^IВ^I_^IADP^I_^I_^I4^Icase^I_^IO$
2^Iто^I_^IDET^I_^IAnimacy=Inan|Case=Acc|Gender=Neut|Number=Sing^I4^Idet^I_^IO$
3^Iже^I_^IPART^I_^I_^I2^Iadvmod^I_^IO$
4^Iвремя^I_^INOUN^I_^IAnimacy=Inan|Case=Acc|Gender=Neut|Number=Sing^I9^Iobl^I_^IO$
5^Iвидеокадры^I_^INOUN^I_^IAnimacy=Inan|Case=Nom|Gender=Masc|Number=Plur^I9^Insubj^I_^IO$
6^Iс^I_^IADP^I_^I_^I7^Icase^I_^IO$
7^Iместа^I_^INOUN^I_^IAnimacy=Inan|Case=Gen|Gender=Neut|Number=Sing^I5^Inmod^I_^IO$
8^Iсобытий^I_^INOUN^I_^IAnimacy=Inan|Case=Gen|Gender=Neut|Number=Plur^I7^Inmod^I_^IO$
9^Iсвидетельствуют^I_^IVERB^I_^IAspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act^I0^Iroot^I_^IO$
10^Iо^I_^IADP^I_^I_^I11^Icase^I_^IO$
11^Iтом^I_^IPRON^I_^IAnimacy=Inan|Case=Loc|Gender=Neut|Number=Sing^I9^Iobl^I_^IO$
12^I,^I_^IPUNCT^I_^I_^I25^Ipunct^I_^IO$
13^Iчто^I_^ISCONJ^I_^I_^I25^Imark^I_^IO$
14^Iпосле^I_^IADP^I_^I_^I15^Icase^I_^IO$
15^Iзвука^I_^INOUN^I_^IAnimacy=Inan|Case=Gen|Gender=Masc|Number=Sing^I25^Iobl^I_^IO$
16^I,^I_^IPUNCT^I_^I_^I17^Ipunct^I_^IO$
17^Iпохожего^I_^IADJ^I_^ICase=Gen|Degree=Pos|Gender=Masc|Number=Sing^I15^Iamod^I_^IO$
18^Iна^I_^IADP^I_^I_^I19^Icase^I_^IO$
19^Iвыстрел^I_^INOUN^I_^IAnimacy=Inan|Case=Acc|Gender=Masc|Number=Sing^I17^Iobl^I_^IO$
20^I,^I_^IPUNCT^I_^I_^I15^Ipunct^I_^IO$
21^Iнаходившихся^I_^IVERB^I_^IAnimacy=Anim|Aspect=Imp|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act^I24^Iacl^I_^IO$
22^Iна^I_^IADP^I_^I_^I23^Icase^I_^IO$

The resulting split is on line 3 (this is tail of the first split):

# sent_id = 170247_3
# text = В то же время видеокадры с места событий свидетельствуют о том, что после звука, похожего на выстрел, находившихся на площади людей охватила паника.
1       В       _       ADP     _       _       4       case    _       O
2       то      _       DET     _       Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing   4       det     _       O
3

The problem disappears if I split a file below 1Gb.

The split is done on a Ubuntu server with 128 GB RAM, in bash via SSH, with GNU Awk 4.1.4, just in case.

What could be a solution to circumvent this problem?

Upvotes: 3

Views: 194

Answers (1)

nav610
nav610

Reputation: 791

If your code works fine on a 1Gb file you can try splitting your input file into smaller files using the split command, then running your awk code on the segmented files.

To split your file into files each with 500 lines:

split -l 500 myfile segment

The output files would be segmentaa, segmentab, segmentac ...

To split your file into files each of size 1Gb:

split -b 1G myfile segment

The output files would be 10 files names segmentaa, segmentab, segmentac ...

Upvotes: 1

Related Questions