Reputation: 45

Splitting a large file with awk into chunks with a defined number of multi line records

I want to chunk a large file (>15G, several millions of records) into smaller chunks with a defined number of records. I am using Ubuntu 16.04.

Here are the rules:

For portability issue, I would like to stick to UNIX commands.
There is a specific pattern defining the end of each record ('$$$$') in the input file.
This pattern should be conserved to separate records in chunks
Each chunk should contain n records
Each record can vary in both number of lines.

I searched for similar questions like this one, but could not find exactly what I was looking for.

Here is an example of the input file syntax.

example.sdf

Item1
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.7946    2.9241    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.9708    2.9673    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
3

$$$$
Element2
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.6161    1.7634    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7956    1.8496    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
5

$$$$
Something3
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.0580    0.5134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.5772    1.1545    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
10

$$$$

Desired output for n=2:

example.sdf.chunk000001

Item1
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.7946    2.9241    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.9708    2.9673    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
3

$$$$
Element2
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.6161    1.7634    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7956    1.8496    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
5

$$$$

example.sdf.chunk000002

Something3
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.0580    0.5134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.5772    1.1545    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
10

$$$$

At the moment, I tried to achieve this with split and awk (see below), but this looks clumsy. I also tried to have a look at csplit, but I could not find any option to set a defined number of records in each chunk.

split

split command works perfectly fine, but does not accept the '$$$$' delimiter as it is more than one character. I can get it work by replacing this pattern with a single character (@), but things could go wrong in case this other character is found in the SDF file.

# replace the separator with a dummy
sed -e 's/\$\$\$\$/@/g' export.sdf > example.sdf.tmp
# split the file (3 records) into smaller chunks (xaa, xab, ect.) with max 2 records
split -t @ -l 2 example.sdf.tmp
# replace the dummy with the proper separator
for f in xa*; do tail -n +2 $f |sed 's/@/\$\$\$\$/g' > $f.fixed; done

Unfortunately, this does not look very optimized to edit the input file and then every chunk, so I tried to use awk instead.

awk

I'm very new to awk, but I managed to get this:

awk 'NR%2==1 {x=sprintf(".chunk%06d",++i);} END {printf "%s",$0} {print>FILENAME x}' RS="\\$\\$\\$\\$" ORS="\$\$\$\$" example.sdf

The first chunk looks exactly what I am looking for but the second has two errors:

example.sdf.chunk000002

[ blank line ]     
Something3
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.0580    0.5134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.5772    1.1545    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
10

$$$$
$$$$

As you can see, there is an empty line (which I could not display so I typed [blank line] instead) at the beginning of the file and one final end pattern at the end of the last chunk. I also tried on a file with 9 records, I got the empty line at the beginning of chunks 2-5 and the final extra '$$$$' at the end of chunk 5).

How could I fix this behavior so I get the expected output?

Any help would be much appreciated!

Jose Manuel

Upvotes: 2

Answers (4)

Ed Morton

Reputation: 203229

With GNU awk for multi-char RS, RT and handling of multiple open files:

$ awk -v RS='\n[$]{4}\n' 'NR%2{out="out"++c} {print $0 RT " > " out}' file
Item1
  Mrv171c009131823372D

  2  1  0  0  0  0            999 V2000
   -3.7946    2.9241    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.9708    2.9673    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
3

$$$$
 > out1
Element2
  Mrv171c009131823372D

  2  1  0  0  0  0            999 V2000
   -3.6161    1.7634    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7956    1.8496    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
5

$$$$
 > out1
Something3
  Mrv171c009131823372D

  2  1  0  0  0  0            999 V2000
   -3.0580    0.5134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.5772    1.1545    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
10

$$$$
 > out2

Just change " > " to > after you've tested and are happy with the output.

With any awk:

awk '
    NR==1 { out="out"++c }
    { print > out }
    ($0=="$$$$") && (((++nr)%2)==0) { close(out); out="out"++c }
' file

Upvotes: 1

kvantour

Reputation: 26471

Here is a small update to the solution of Cortenin Limier

original:

awk 'BEGIN{n_records=2; counter=0}
    { print > "file_" int(counter/n_records) ".txt";
      if($0 ~ /\$\$\$\$/){counter++}}' example.sdf

update:

awk 'BEGIN{n_records=2; }
     (NR==1){ file=sprintf(FILENAME ".chunk%0.6d",counter) }
     { print > file }
     ($0=="$$$$"){ 
         close(file); 
         file=sprintf(FILENAME ".chunk%0.6d",(++counter/n_records))
     }' example.sdf

The differences are:

any variable is by default ZERO or an empty string, so no need to define counter=0
variable file holds the filename, so it is not generated at each step
the file is closed when it is not needed anymore.
We check if the record separator is actually at the beginning and end of the line.
The output files will have the form FILENAME.chunknnnnnn where FILENAME is substituted by the original file called here example.sdf

Upvotes: 0

oliv

Reputation: 13249

Using GNU awk:

awk -v RS='\\$\\$\\$\\$\n' -v nb=2 -v c=1 '
{
   file=sprintf("%s%s%06d",FILENAME,".chunk",c)
   printf "%s%s",$0,RT > file 
}
NR%nb==0 {c++}
' example.sdk

The record separator RS to the pattern $$$$ allows to get the full chunk at once.

The variable nb holds the number of chunk per file, and c is the counting number for the filename.

Upvotes: 0

Corentin Limier

Reputation: 5006

This should work :

awk 'BEGIN{n_records=2; counter=0};{print > "file_" int(counter/n_records) ".txt"; if($0 ~ /\$\$\$\$/){counter++}}' example.sdf

Upvotes: 1

Splitting a large file with awk into chunks with a defined number of multi line records

split

awk

Answers (4)

Related Questions