split a single XML file into 2 files based on tag count

Question

I have a XML file that is composed of numerous XML records bounded by the tag like:

  
 
      24273127
...
...
...


  
 
      24273128
...
...
...


  
 
      24273129
...
...
...

I can generate individual files using

awk '/ out} /<\/PubmedArticle>/{close(out); f=0}' mylargefile

see https://stackoverflow.com/a/56892175/6876770

How can I generate files composed of a specific number of records each? For example supposing I had a large XML file with 1000 XML records and I wanted to create 2 x 500 XML record files?

Im thinking that awk should save to a file until it meets the defined number of tag matches and then it saves to another file.

Ed Morton · Accepted Answer

The difference between your problem and the problem in the question you referenced is that they had blocks in their input that they did not want to appear in the output while you want every line of input to appear in an output file, so while they had to print only what was between start and end tags, and so had to test for both, you don't have that problem and only need to test for the start OR end tag to determine when to change output files.

With any awk:

$ awk -v maxRecs=2 '
    // && ((++recNr % maxRecs) == 1) {
        close(out); out="mysmallfile_" (++fileNr) ".xml"
    }
    { print > out }
' file

$ head mysmallfile*
==> mysmallfile_1.xml <==

 
      24273127
...
...
...



 

==> mysmallfile_2.xml <==

 
      24273129
...
...
...

or with GNU awk for multi-char RS and RT:

$ awk -v maxRecs=2 -v RS='
' -v ORS= '
    (NR % maxRecs) == 1 {
        close(out); out="mysmallfile_" (++fileNr) ".xml"
    }
    RT { print $0 RT > out }
' file

$ head mysmallfile*
==> mysmallfile_1.xml <==

 
      24273127
...
...
...



 

==> mysmallfile_2.xml <==

 
      24273129
...
...
...

I'm calling close() to close every output file as we go to avoid a possible "too many open files" error from most awks or a slow down from gawk.

The above assumes your goal is to create as many files of "maxRecs" length as possible and then just have whatever is left of the input put into the last file so if you have an input file of 800 records you'll get an output file of 500 records and another of 300 as opposed to 2 output files of 400 each.

split a single XML file into 2 files based on tag count

Answers (2)

Related Questions