Reputation: 13

How to split a huge xml-file to smaller files after nth occurrence of certain tag?

I have a 30 GB xml-file, and I would like to split it to smaller files.

The data in the file is like this:

<film>.....</film>
.
.
.
.
.
.
<film>.....</film>

I could use "split -l" but the problem is that some film-elements contain text-data with line breaks. So one film-element may take more than one line.

What I would like to do is to split it so that each new smaller file would contain for example 3000 film-elements. So it should split it after every 3000th film-tag...

I am using Mac OS X and I would like to have an awk solution.

I tried to use this split file on Nth occurrence of delimiter but didn't succeed... It didn't split the files after ending film-tags...

Upvotes: 1

Answers (3)

George Vasiliou

Reputation: 6345

When Ed Morton posts an awk solution it is usually a small tutorial for low level users like me...

But in any case since i have been working on this exercise for the last hour and a half, i thought to take my risk to post this solution which is a transformation by the link you already found

$ awk '$0 ~/<film.*>/{++delim} {file = sprintf("chunk%s", int(delim/7)); print >file; }' file4

Testing:
I used a small bash loop to create a small film file with 50 records and split those films by 7 for testing:

$ for ((i=1;i<50;i++));do echo -e "<film$i>..............</film$i>" >>file4;done
$ head file4
<film1>..............</film1>
<film2>..............</film2>
<film3>..............</film3>
<film4>..............</film4>
<film5>..............</film5>
<film6>..............</film6>
<film7>..............</film7>
<film8>..............</film8>
<film9>..............</film9>
<film10>..............</film10>

$ awk '$0 ~/<film.*>/{++delim} {file = sprintf("chunk%s", int(delim/7)); print >file; }' file4 
$ cat chunk0
<film1>..............</film1>
<film2>..............</film2>
<film3>..............</film3>
<film4>..............</film4>
<film5>..............</film5>
<film6>..............</film6>

Another test in which each film has some newlines:

$ for ((i=1;i<50;i++));do echo -e "<film$i>...\n...\n...\n.....</film$i>" >>file4;done
$ head -n20 file4
<film1>...
...
...
.....</film1>
<film2>...
...
...
.....</film2>
<film3>...
...
...
.....</film3>
<film4>...
...
...
.....</film4>
<film5>...
...
...
.....</film5>


$ awk '$0 ~/<film.*>/{++delim} {file = sprintf("chunk%s", int(delim/7)); print >file; }' file4 

$ ls chunk*
chunk0  chunk1  chunk2  chunk3  chunk4  chunk5  chunk6  chunk7

$ cat chunk1
<film7>...
...
...
.....</film7>
<film8>...
...
...
.....</film8>
<film9>...
...
...
.....</film9>
<film10>...
...
...
.....</film10>
<film11>...
...
...
.....</film11>
<film12>...
...
...
.....</film12>
<film13>...
...
...
.....</film13>

Well, in both cases seems to work ok. Mind that in this configuration input file is splitted per 7 films - not per 7 lines. You can change this number to whatever.

Upvotes: 1

Ed Morton

Reputation: 204488

Chances are something like this is what you're looking for:

awk '{ gsub(/@/,"@A"); gsub(/}/,"@B"); gsub(/<\/film>\n?/,"}") } 1' file |
awk -v RS='}' -v ORS='</film>' '
    (NR%3000)==1 { close(out); out="out"++cnt }
    { gsub(/@B/,"}"); gsub(/@A/,"@"); print > out }
'

but without sample input/output it's a guess and, of course, untested.

Upvotes: 2

Michael Kay

Reputation: 163595

A job for streaming XSLT 3.0:

<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:mode streamable="true" on-no-match="shallow-copy"/>    
  <xsl:template match="/*">
    <xsl:for-each-group select="*" group-adjacent="(position()-1) idiv 3000">
      <xsl:result-document href="chunk{position()}.xml">
        <xsl:copy>
          <xsl:copy-of select="."/>
        </xsl:copy>
      </xsl:result-document>
    </xsl:for-each-group>
  </xsl:template>
</xsl:transform>

This is going to be much more robust than an awk solution because it actually parses the XML so it guarantees well-formed input and well-formed output. When you're processing 30Gb, you can't check the output by hand, so there's a grave danger of undetected garbage if you fail to anticipate everything that can arise in the input (e.g. a film with "film" in its title). So working properly on the structure of the markup is much safer.

The other thing is that if your input is well-formed XML, it has a wrapper element around the <film> elements, and if the output is to be processed as XML, it will need a similar wrapper element. The XSLT solution handles this for free.

As you may have noticed, this stylesheet can split ANY xml file into chunks, and of course the chunk size could easily be supplied as a parameter.

Upvotes: 3

How to split a huge xml-file to smaller files after nth occurrence of certain tag?

Answers (3)

Related Questions