Jose Manuel
Jose Manuel

Reputation: 45

Splitting a large file with awk into chunks with a defined number of multi line records

I want to chunk a large file (>15G, several millions of records) into smaller chunks with a defined number of records. I am using Ubuntu 16.04.

Here are the rules:

  1. For portability issue, I would like to stick to UNIX commands.
  2. There is a specific pattern defining the end of each record ('$$$$') in the input file.
  3. This pattern should be conserved to separate records in chunks
  4. Each chunk should contain n records
  5. Each record can vary in both number of lines.

I searched for similar questions like this one, but could not find exactly what I was looking for.

Here is an example of the input file syntax.

example.sdf

Item1
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.7946    2.9241    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.9708    2.9673    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
3

$$$$
Element2
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.6161    1.7634    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7956    1.8496    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
5

$$$$
Something3
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.0580    0.5134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.5772    1.1545    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
10

$$$$

Desired output for n=2:

example.sdf.chunk000001

Item1
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.7946    2.9241    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.9708    2.9673    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
3

$$$$
Element2
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.6161    1.7634    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7956    1.8496    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
5

$$$$

example.sdf.chunk000002

Something3
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.0580    0.5134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.5772    1.1545    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
10

$$$$

At the moment, I tried to achieve this with split and awk (see below), but this looks clumsy. I also tried to have a look at csplit, but I could not find any option to set a defined number of records in each chunk.

split

split command works perfectly fine, but does not accept the '$$$$' delimiter as it is more than one character. I can get it work by replacing this pattern with a single character (@), but things could go wrong in case this other character is found in the SDF file.

# replace the separator with a dummy
sed -e 's/\$\$\$\$/@/g' export.sdf > example.sdf.tmp
# split the file (3 records) into smaller chunks (xaa, xab, ect.) with max 2 records
split -t @ -l 2 example.sdf.tmp
# replace the dummy with the proper separator
for f in xa*; do tail -n +2 $f |sed 's/@/\$\$\$\$/g' > $f.fixed; done

Unfortunately, this does not look very optimized to edit the input file and then every chunk, so I tried to use awk instead.

awk

I'm very new to awk, but I managed to get this:

awk 'NR%2==1 {x=sprintf(".chunk%06d",++i);} END {printf "%s",$0} {print>FILENAME x}' RS="\\$\\$\\$\\$" ORS="\$\$\$\$" example.sdf

The first chunk looks exactly what I am looking for but the second has two errors:

example.sdf.chunk000002

[ blank line ]     
Something3
  Mrv171c009131823372D          

  2  1  0  0  0  0            999 V2000
   -3.0580    0.5134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.5772    1.1545    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
10

$$$$
$$$$

As you can see, there is an empty line (which I could not display so I typed [blank line] instead) at the beginning of the file and one final end pattern at the end of the last chunk. I also tried on a file with 9 records, I got the empty line at the beginning of chunks 2-5 and the final extra '$$$$' at the end of chunk 5).

How could I fix this behavior so I get the expected output?

Any help would be much appreciated!

Jose Manuel

Upvotes: 2

Views: 1675

Answers (4)

Ed Morton
Ed Morton

Reputation: 203229

With GNU awk for multi-char RS, RT and handling of multiple open files:

$ awk -v RS='\n[$]{4}\n' 'NR%2{out="out"++c} {print $0 RT " > " out}' file
Item1
  Mrv171c009131823372D

  2  1  0  0  0  0            999 V2000
   -3.7946    2.9241    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.9708    2.9673    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
3

$$$$
 > out1
Element2
  Mrv171c009131823372D

  2  1  0  0  0  0            999 V2000
   -3.6161    1.7634    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7956    1.8496    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
5

$$$$
 > out1
Something3
  Mrv171c009131823372D

  2  1  0  0  0  0            999 V2000
   -3.0580    0.5134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.5772    1.1545    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  END
> <property_1>
10

$$$$
 > out2

Just change " > " to > after you've tested and are happy with the output.

With any awk:

awk '
    NR==1 { out="out"++c }
    { print > out }
    ($0=="$$$$") && (((++nr)%2)==0) { close(out); out="out"++c }
' file

Upvotes: 1

kvantour
kvantour

Reputation: 26471

Here is a small update to the solution of Cortenin Limier

original:

awk 'BEGIN{n_records=2; counter=0}
    { print > "file_" int(counter/n_records) ".txt";
      if($0 ~ /\$\$\$\$/){counter++}}' example.sdf

update:

awk 'BEGIN{n_records=2; }
     (NR==1){ file=sprintf(FILENAME ".chunk%0.6d",counter) }
     { print > file }
     ($0=="$$$$"){ 
         close(file); 
         file=sprintf(FILENAME ".chunk%0.6d",(++counter/n_records))
     }' example.sdf

The differences are:

  • any variable is by default ZERO or an empty string, so no need to define counter=0
  • variable file holds the filename, so it is not generated at each step
  • the file is closed when it is not needed anymore.
  • We check if the record separator is actually at the beginning and end of the line.
  • The output files will have the form FILENAME.chunknnnnnn where FILENAME is substituted by the original file called here example.sdf

Upvotes: 0

oliv
oliv

Reputation: 13249

Using GNU awk:

awk -v RS='\\$\\$\\$\\$\n' -v nb=2 -v c=1 '
{
   file=sprintf("%s%s%06d",FILENAME,".chunk",c)
   printf "%s%s",$0,RT > file 
}
NR%nb==0 {c++}
' example.sdk

The record separator RS to the pattern $$$$ allows to get the full chunk at once.

The variable nb holds the number of chunk per file, and c is the counting number for the filename.

Upvotes: 0

Corentin Limier
Corentin Limier

Reputation: 5006

This should work :

awk 'BEGIN{n_records=2; counter=0};{print > "file_" int(counter/n_records) ".txt"; if($0 ~ /\$\$\$\$/){counter++}}' example.sdf

Upvotes: 1

Related Questions