Reputation: 1754
I have a very large xml file (1.25 GB) that I need to split into smaller files to be able to process them. The file contains linguistic data that is headed and footed by the tags:
< text id="www.example.com>
and
< /text>
I would like to split the larger file by these tags. So that, for example,
< text id="www.example.com>
Hello
< /text>
< text id="www.example.com>
This is
< /text>
< text id="www.example.com>
An Example
< /text>
Would essentially be three different files: with the beginning and end marked by the "text" tags. For example:
File 1
< text id="www.example.com>
Hello
< /text>
File 2
< text id="www.example.com>
This is
< /text>
File 3
< text id="www.example.com>
An Example
< /text>
I suppose this could be done by scripting in Perl, for instance, but I'm wondering if there's any kind of "one stop shop" way to split this file using unix.
I know that the splitting command is useful to split a large file into smaller files depending on lines or file size. However, is there a similar command that permits the splitting by xml tag?
Thanks in advance for any help!
Upvotes: 1
Views: 7207
Reputation: 1754
The following PERL program found here: Split one file into multiple files based on delimiter
#!/usr/bin/perl
open(FI,"file.txt") or die;
$cur=0;
open(FO,">res.$cur.txt") or die;
while(<FI>)
{
print FO $_;
if(/^<\/text>/) # Added \
{
close(FO);
$cur++;
open(FO,">res.$cur.txt") or die;
}
}
close(FO);
Also seems to do the trick, with no maximum cap.
Cheers.
Upvotes: 2
Reputation: 5241
It's a lot more complicated than a simple awk command, and I don't if the file would be to big or not, but you could try using an XSLT V2.0 style sheet with result-document
to produce all of your files.
One advantage of using XSLT over a regex is that it would have better support if the file format changes slightly or if there are attributes on the nodes you want to split with.
Upvotes: 1
Reputation: 1754
The following awk solves the problem, but unfortunately caps out at around 1000 output files
awk '{print $0 ""> "file" NR}' RS='' input-file
Upvotes: 1