Reputation: 879

Parsing XML file in Perl - Retain sequence

The XML Structure is as below:

<Entities>
    <Entity>
        <EntityName>.... </EntityName>
        <EntityType>.... </EntityType>
        <Tables>
            <DataTables>
                <DataTable>1</DataTable>
                <DataTable>2</DataTable>
                <DataTable>3</DataTable>
                <DataTable>4</DataTable>
            </DataTables>
            <OtherTables>
                <OtherTable>5</OtherTable>
                <OtherTable>6</OtherTable>
            </OtherTables>
        </Tables>
    </Entity>
.
.
.
</Entities>

I need to parse the file based on the Entity name selected and retrieve all the tables specifically in the order mentioned. How do I do this in Perl and which module should be used?

Upvotes: 2

Answers (4)

Susheel Javadi

Reputation: 3084

My favourite module to parse XML in Perl is XML::Twig (tutorial).

Code Sample:

use XML::Twig;

my $twig = XML::Twig->new(
    twig_handlers => {
        #calls the get_tables method for each Entity element
        Entity    => sub {get_tables($_);},
    },
    pretty_print  => 'indented',                # output will be nicely formatted
    empty_tags    => 'html',                    # outputs <empty_tag />
    keep_encoding => 1,
);

$twig->parsefile(xml-file);
$twig->flush;

sub get_tables {
    my $entity = shift;

    #Retrieves the sub-elements of DataTables
    my @data_tables = $entity->first_child("Tables")->children("DataTables");
    #Do stuff with the DataTables

    #Retrieves the sub-elements of OtherTables
    my @other_tables = $entity->first_child("Tables")->children("OtherTables");
    #Do stuff with the OtherTables

    #Flushes the XML element from memory
    $entity->purge;
}

Upvotes: 8

reinierpost

Reputation: 8591

I prefer XML::LibXML, which allows you (and me) to use XPath to select elements.

You may wish to look at a script I wrote with it.

Upvotes: 0

Greg Bacon

Reputation: 139531

Document order is defined as

There is an ordering, document order, defined on all the nodes in the document corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities. Thus, the root node will be the first node. Element nodes occur before their children. Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML (after expansion of entities).

In other words, the order in which things occur in the XML document. The XML::XPath module produces results in document order. For example:

#! /usr/bin/perl

use warnings;
use strict;

use XML::XPath;

my $entity_template = "/Entities"
                    . "/Entity"
                    .   "[EntityName='!!NAME!!']"
                    ;

my $tables_path = join "|" =>
                  qw( ./Tables/DataTables/DataTable
                      ./Tables/OtherTables/OtherTable );

my $xp = XML::XPath->new(ioref => *DATA);

foreach my $ename (qw/ foo bar /) {
  print "$ename:\n";
  (my $path = $entity_template) =~ s/!!NAME!!/$ename/g;
  foreach my $n ($xp->findnodes($path)) {
    foreach my $t ($xp->findnodes($tables_path, $n)) {
      print $t->toString, "\n";
    }
  }
}

__DATA__

The first expression searches for <Entity> elements where each has an <ElementName> child whose string-value is the Entity name selected. From there, we look for <DataTable> or <OtherTable>.

Given input of

<Entities>
    <Entity>
        <EntityName>foo</EntityName>
        <EntityType>type1</EntityType>
        <Tables>
            <DataTables>
                <DataTable>1</DataTable>
                <DataTable>2</DataTable>
            </DataTables>
            <OtherTables>
                <OtherTable>3</OtherTable>
                <OtherTable>4</OtherTable>
            </OtherTables>
        </Tables>
    </Entity>
    <Entity>
        <EntityName>bar</EntityName>
        <EntityType>type2</EntityType>
        <Tables>
            <DataTables>
                <DataTable>5</DataTable>
                <DataTable>6</DataTable>
            </DataTables>
            <OtherTables>
                <OtherTable>7</OtherTable>
                <OtherTable>8</OtherTable>
            </OtherTables>
        </Tables>
    </Entity>
</Entities>

the output is

foo:
<DataTable>1</DataTable>
<DataTable>2</DataTable>
<OtherTable>3</OtherTable>
<OtherTable>4</OtherTable>
bar:
<DataTable>5</DataTable>
<DataTable>6</DataTable>
<OtherTable>7</OtherTable>
<OtherTable>8</OtherTable>

To extract the string-values (the “inner text”), change $tables_path to

my $tables_path = ". / Tables / DataTables  / DataTable  / text() |
                   . / Tables / OtherTables / OtherTable / text()";

Yes, that's repetitive—because XML::XPath implements XPath 1.0.

Output:

foo:
1
2
3
4
bar:
5
6
7
8

Upvotes: 2

Nikhil Jain

Reputation: 8342

See : xml-simple

before using it, keep in mind, some points like

XML::Simple is able to present a simple API because it makes some assumptions on your behalf. These include:

You're not interested in text content consisting only of whitespace
You don't mind that when things get slurped into a hash the order is lost
You don't want fine-grained control of the formatting of generated XML
You would never use a hash key that was not a legal XML element name
You don't need help converting between different encodings

For event based parsing, use SAX (do not set out to write any new code for XML::Parser's handler API - it is obselete).

For tree-based parsing, you could choose between the 'Perlish' approach of XML::Twig and more standards based DOM implementations - preferably one with XPath support.

source: XML-Simple

For more detail about Perl-XML, see Perl-XML

Upvotes: -1

Parsing XML file in Perl - Retain sequence

Answers (4)

Related Questions