Reputation: 45
I want to chunk a large file (>15G, several millions of records) into smaller chunks with a defined number of records. I am using Ubuntu 16.04.
Here are the rules:
I searched for similar questions like this one, but could not find exactly what I was looking for.
Here is an example of the input file syntax.
example.sdf
Item1
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.7946 2.9241 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.9708 2.9673 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
3
$$$$
Element2
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.6161 1.7634 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.7956 1.8496 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
5
$$$$
Something3
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.0580 0.5134 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-3.5772 1.1545 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
10
$$$$
Desired output for n=2:
example.sdf.chunk000001
Item1
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.7946 2.9241 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.9708 2.9673 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
3
$$$$
Element2
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.6161 1.7634 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.7956 1.8496 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
5
$$$$
example.sdf.chunk000002
Something3
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.0580 0.5134 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-3.5772 1.1545 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
10
$$$$
At the moment, I tried to achieve this with split and awk (see below), but this looks clumsy. I also tried to have a look at csplit, but I could not find any option to set a defined number of records in each chunk.
split command works perfectly fine, but does not accept the '$$$$' delimiter as it is more than one character. I can get it work by replacing this pattern with a single character (@), but things could go wrong in case this other character is found in the SDF file.
# replace the separator with a dummy
sed -e 's/\$\$\$\$/@/g' export.sdf > example.sdf.tmp
# split the file (3 records) into smaller chunks (xaa, xab, ect.) with max 2 records
split -t @ -l 2 example.sdf.tmp
# replace the dummy with the proper separator
for f in xa*; do tail -n +2 $f |sed 's/@/\$\$\$\$/g' > $f.fixed; done
Unfortunately, this does not look very optimized to edit the input file and then every chunk, so I tried to use awk instead.
I'm very new to awk, but I managed to get this:
awk 'NR%2==1 {x=sprintf(".chunk%06d",++i);} END {printf "%s",$0} {print>FILENAME x}' RS="\\$\\$\\$\\$" ORS="\$\$\$\$" example.sdf
The first chunk looks exactly what I am looking for but the second has two errors:
example.sdf.chunk000002
[ blank line ]
Something3
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.0580 0.5134 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-3.5772 1.1545 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
10
$$$$
$$$$
As you can see, there is an empty line (which I could not display so I typed [blank line] instead) at the beginning of the file and one final end pattern at the end of the last chunk. I also tried on a file with 9 records, I got the empty line at the beginning of chunks 2-5 and the final extra '$$$$' at the end of chunk 5).
How could I fix this behavior so I get the expected output?
Any help would be much appreciated!
Jose Manuel
Upvotes: 2
Views: 1675
Reputation: 203229
With GNU awk for multi-char RS, RT and handling of multiple open files:
$ awk -v RS='\n[$]{4}\n' 'NR%2{out="out"++c} {print $0 RT " > " out}' file
Item1
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.7946 2.9241 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.9708 2.9673 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
3
$$$$
> out1
Element2
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.6161 1.7634 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.7956 1.8496 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
5
$$$$
> out1
Something3
Mrv171c009131823372D
2 1 0 0 0 0 999 V2000
-3.0580 0.5134 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-3.5772 1.1545 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <property_1>
10
$$$$
> out2
Just change " > "
to >
after you've tested and are happy with the output.
With any awk:
awk '
NR==1 { out="out"++c }
{ print > out }
($0=="$$$$") && (((++nr)%2)==0) { close(out); out="out"++c }
' file
Upvotes: 1
Reputation: 26471
Here is a small update to the solution of Cortenin Limier
original:
awk 'BEGIN{n_records=2; counter=0}
{ print > "file_" int(counter/n_records) ".txt";
if($0 ~ /\$\$\$\$/){counter++}}' example.sdf
update:
awk 'BEGIN{n_records=2; }
(NR==1){ file=sprintf(FILENAME ".chunk%0.6d",counter) }
{ print > file }
($0=="$$$$"){
close(file);
file=sprintf(FILENAME ".chunk%0.6d",(++counter/n_records))
}' example.sdf
The differences are:
counter=0
file
holds the filename, so it is not generated at each stepfile
is closed when it is not needed anymore.FILENAME.chunknnnnnn
where FILENAME
is substituted by the original file called here example.sdf
Upvotes: 0
Reputation: 13249
Using GNU awk:
awk -v RS='\\$\\$\\$\\$\n' -v nb=2 -v c=1 '
{
file=sprintf("%s%s%06d",FILENAME,".chunk",c)
printf "%s%s",$0,RT > file
}
NR%nb==0 {c++}
' example.sdk
The record separator RS
to the pattern $$$$
allows to get the full chunk at once.
The variable nb
holds the number of chunk per file, and c
is the counting number for the filename.
Upvotes: 0
Reputation: 5006
This should work :
awk 'BEGIN{n_records=2; counter=0};{print > "file_" int(counter/n_records) ".txt"; if($0 ~ /\$\$\$\$/){counter++}}' example.sdf
Upvotes: 1