backtrack
backtrack

Reputation: 8144

splitting xml files using perl script

Hi Im working on perl script to split Big xml to small chunks. And i have refereed this link Split file by XML tag

and my code is like this

if($line =~ /^</row>/)
{
$count++;
}

but im getting this error

 works\filesplit.pl line 20.
Bareword found where operator expected at E:\Work\perl works\filesplit.pl line 2
0, near "/^</row"
        (Missing operator before row?)
syntax error at E:\Work\perl works\filesplit.pl line 20, near "/^</row"
Search pattern not terminated at E:\Work\perl works\filesplit.pl line 20.

Can anyone help me

Update

<row>
  <date></date>
  <ForeignpostingId />
  <country>11</country>
  <domain>http://www.xxxx.com</domain>
  <domainid>20813</domainid>
 </row>
 <row>
  <date></date>
  <ForeignpostingId />
  <country>11</country>
  <domain>http://www.xxxx.com</domain>
  <domainid>20813</domainid>
 </row>
 <row>
  <date></date>
  <ForeignpostingId />
  <country>11</country>
  <domain>http://www.xxxx.com</domain>
  <domainid>20813</domainid>
 </row>

Upvotes: 2

Views: 3012

Answers (4)

mirod
mirod

Reputation: 16161

Have you tried xml_split? It's a tool that comes with XML::Twig that's specifically designed to split big XML files, based on a variety of criteria (tag name, level, size).

Upvotes: 3

Kenosis
Kenosis

Reputation: 6204

Perhaps the following will be helpful:

use strict;
use warnings;

my $i = 1;
local $/ = '<row>';

while (<>) {
    chomp;
    s!</row>!! or next;

    open my $fh, '>', 'File_' . ( sprintf '%05d', $i++ ) . '.xml' or die $!;
    print $fh $_;
}

Usage: perl script.pl inFile.xml

This sets Perl's record separator $/ to <row> to read the xml file in those 'chunks' delimited by <row>. It removes the </row> from the chunk, then writes out that chunk to a file that has the naming scheme of "File_nnnnn.xml".

Upvotes: 2

prashant
prashant

Reputation: 1484

#!/bin/perl -w

## splitting xml files using perl script

print "Input File ? ";
chomp($XmlFile = <STDIN>);

open $XmlFileHandle,'<',$XmlFile;

print "\nSplit By which Tag ? ";
chomp($splitby = <STDIN>);

open $OutputHandle, '>','OutputFile_'.$splitby;

## to split by <user>...</user>
while(<$XmlFileHandle>){
    if(/<$splitby>/){
        print $OutputHandle "<$splitby>\n";
        last;
    }
}

while(<$XmlFileHandle>){
    $line = $_;
    if($line =~ m/<\/$splitby>/){
        print $OutputHandle "</$splitby>";
        last;
    }
    print $OutputHandle $line;
}

print "\nOutput File is : OutputFile_$splitby\n";

Upvotes: 0

slayedbylucifer
slayedbylucifer

Reputation: 23502

You need ^<\/row> provided that you are trying to match </row> at the beginning of the line. Here is my test code.

#!/usr/bin/perl
use strict;
use warnings;

my $line = "</row> something";
if ($line =~ /^<\/row>/)
{
    print "found a match \n";
}

OUTPUT:

# perl test.pl 
found a match 

Update

posting this update after OP provided sample data.

You need ^\s+<\/row> in your regex because not all of them are starting at the beginning of the line. some of them have one space before them. hence we need to match zero or more spaces at the begining of the line before we do actual match.

code:

#!/usr/bin/perl -w
use strict;
use warnings;

while (my $line = <DATA>)
{
    if ($line =~ /^\s+<\/row>/)
    {
        print "found a match \n";
    }
}

__DATA__
<row>
  <date></date>
  <ForeignpostingId />
  <country>11</country>
  <domain>http://www.xxxx.com</domain>
  <domainid>20813</domainid>
 </row>
 <row>
  <date></date>
  <ForeignpostingId />
  <country>11</country>
  <domain>http://www.xxxx.com</domain>
  <domainid>20813</domainid>
 </row>
 <row>
  <date></date>
  <ForeignpostingId />
  <country>11</country>
  <domain>http://www.xxxx.com</domain>
  <domainid>20813</domainid>
 </row>

Output:

# perl test.pl 
found a match 
found a match 
found a match 

Upvotes: 2

Related Questions