Seki
Seki

Reputation: 11465

Buggy output from XML::Twig on a Tee

I am trying to split an xml file into multiple well-formed fragments, and an ancient PerlMonks solution is doing pretty well what I am looking for with help of XML::Twig that spits into a Tee... at least with simple data input.

If I complicate a little bit the data structure by regrouping the nodes to filter into a parent node, the second file is not well formed: the parent node is missing its opening tag. And I am quite lost to find the cause.

SSCCE (the difference with initial example is the <thing_list> that contains the <thing>'s):

use XML::Twig;
use IO::Tee;
use feature 'say';

open my $frufile, '>', 'fruit.xml' or die "fruit $!";
open my $vegfile, '>', 'veg.xml' or die "veg $!";

my $tee = IO::Tee->new($frufile, $vegfile);
select $tee;

my $twig=XML::Twig->new(
    twig_handlers => {
        thing  => \&magic,
        _default_  => sub { 
            say STDOUT '_default_ for '.$_->name;
            $_[0]->flush($tee); #default filehandle = tee 
            1; 
        },
    },
    pretty_print => 'indented',
    empty_tags   => 'normal',
);

$twig->parse( *DATA );

sub magic {
    my ($thing, $element) = @_;
    say STDOUT "magic for ". $element->{att}{type};
    for ($element->{att}{type}) {
            if (/fruit/) {
                $thing->flush($frufile);
            } elsif (/vegetable/) {
                $thing->flush($vegfile);
            } else {
                $thing->purge;
            }
    }
    1;
}

__DATA__
<batch>
  <header>
    <foo>1</foo>
    <bar>2</bar>
    <baz>3</baz>
  </header>
  <thing_list>
  <thing type="fruit"     >Im an apple!</thing>
  <thing type="city"      >Toronto</thing>
  <thing type="vegetable" >Im a carrot!</thing>
  <thing type="city"      >Melrose</thing>
  <thing type="vegetable" >Im a potato!</thing>
  <thing type="fruit"     >Im a pear!</thing>
  <thing type="vegetable" >Im a pickle!</thing>
  <thing type="city"      >Patna</thing>
  <thing type="fruit"     >Im a banana!</thing>
  <thing type="vegetable" >Im an eggplant!</thing>
  <thing type="city"      >Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu</thing>
  </thing_list>
  <trailer>
    <chrzaszcz>A</chrzaszcz>
    <zdzblo>B</zdzblo>
  </trailer>
</batch>

While the first fruit.xml is ok:

<batch>
  <header>
    <foo>1</foo>
    <bar>2</bar>
    <baz>3</baz>
  </header>
  <thing_list>
    <thing type="fruit">Im an apple!</thing>
    <thing type="fruit">Im a pear!</thing>
    <thing type="fruit">Im a banana!</thing>
  </thing_list>
  <trailer>
    <chrzaszcz>A</chrzaszcz>
    <zdzblo>B</zdzblo>
  </trailer>
</batch>

the veg.xml is missing an opening tag for <thing_list>

<batch>
  <header>
    <foo>1</foo>
    <bar>2</bar>
    <baz>3</baz>
  </header>
    <thing type="vegetable">Im a carrot!</thing>
    <thing type="vegetable">Im a potato!</thing>
    <thing type="vegetable">Im a pickle!</thing>
    <thing type="vegetable">Im an eggplant!</thing>
  </thing_list>
  <trailer>
    <chrzaszcz>A</chrzaszcz>
    <zdzblo>B</zdzblo>
  </trailer>
</batch>

I have also noticed that if I comment out the <thing_list> tags into the data, the comment corresponding to the opening tag is also missing from veg.xml, but not from fruit.xml...

I seem to understand that the first comment is coming while processing the first <thing> and the second should be processed from the _default_ handler while processing the rest of the file. But I do not understand if it is the same while <thing_list> is not commented.

WFIW, I am using Strawberry's Perl 5.20.1 on a Windows 7 box

Upvotes: 1

Views: 86

Answers (1)

ikegami
ikegami

Reputation: 386656

Oh wow, I'm surprised that works as well as it does!

The first time you reach $thing->flush($frufile);, it prints everything before it that hasn't been flushed yet. If it wasn't for your earlier attempt to fix this, it would have output:

<batch>
  <header>
    <foo>1</foo>
    <bar>2</bar>
    <baz>3</baz>
  </header>
  <thing_list>
    <thing type="fruit">Im an apple!</thing>

With your attempt, it prints

  <thing_list>
    <thing type="fruit">Im an apple!</thing>

The subsequent times you call magic, <thing_list> and everything before it has already been printed, so it's not printed again.

Don't mix and match output handles! If you have two files to generate, process the template twice. (And get rid of that _default_ twig handler.)


That said, switching from twig_handlers to twig_roots (which is better for large documents anyway) appears to work:

my $twig = XML::Twig->new(
    twig_roots => {
        'thing_list/thing' => sub {
            my ($t, $element) = @_;
            for ($element->{att}{type}) {
                if (/fruit/) {
                    $t->flush($frufile);
                } elsif (/vegetable/) {
                    $t->flush($vegfile);
                } else {
                    $t->purge;
                }
            }
        },
    },
    twig_print_outside_roots => 1,
    pretty_print => 'indented',
    empty_tags => 'normal',
);

Use at your own risk :)

Upvotes: 3

Related Questions