Ed Wall
Ed Wall

Reputation: 1

Parsing XML file into fields

I need to read lines from an XML file and parse them into fields. A line is defined as text starting with a < and ending with />. It may be a single line or multiple lines separated by CR/LF. Here is a typical line:

<Label Name="lblIncidentTypeContent" Increasable="true" Left="140" Top="60"
 Width="146 SpeechField="IncidentType_V" TextAlign="MiddleLeft" WidthPixel="-180"
 WidthPercent="50" />

Once I've read the line, I then need to parse it into fields such as Name, Left, Width, etc. I then want to output a CSV with the data in a particular order. Then read the next line until EOF.

It's been a long time since I did Perl (or any other kind of) programming. Any help is welcome.

Upvotes: 0

Views: 377

Answers (2)

amon
amon

Reputation: 57640

Don't view XML as line-based data, as it isn't. Rather, use a good XML parser, of which Perl has plenty.

Do not use XML::Simple!

Its own documentation says it is deprecated:

The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces. In particular, XML::LibXML is highly recommended.

The major problems with this module are the large number of options and the arbitrary ways in which these options interact - often with unexpected results.

So we're gonna use XML::LibXML module, which interfaces with the external libxml2 library from the GNOME project. This has the advantage that we can use XPath expressions to query our data. For reading from or writing to CSV, the Text::CSV module should be used.

use strict; use warnings;
use XML::LibXML;
use Text::CSV;

# load the data
my $data = XML::LibXML->load_xml(IO => \*STDIN) or die "Can't parse the XML";

# prepare CSV output:
my $csv = Text::CSV->new({ binary => 1, escape_char => "\\", eol => "\n" });
# Text::CSV doesn't like bareword filehandles
open my $output, '>&:utf8', STDOUT or die "Can't dup STDOUT: $!";

my @cols  = qw/ name left width /; # the column names in the CSV
my @attrs = qw/ Name Left Width /; # the corresponding attr names in the XML

# print the header
$csv->print($output, \@cols);

# extract data
for my $label ($data->findnodes('//Label')) {
  my @fields = map { $label->getAttribute($_) } @attrs;
  $csv->print($output, \@fields);
}

Test data (I took the liberty to close the value of the Width attr):

<foo>
  <Label Name="lblIncidentTypeContent" Increasable="true" Left="140" Top="60"
    Width="146" SpeechField="IncidentType_V" TextAlign="MiddleLeft" WidthPixel="-180"
    WidthPercent="50" />
  <Label Name="Another TypeContent" Increasable="true"
         Width="123"                SpeechField="IncidentType_V"
         Left="41,42"               Top="13"
         TextAlign="TopLeft"        WidthPixel="-180"
         WidthPercent="50"
  />
</foo>

Output:

name,left,width
lblIncidentTypeContent,140,146
"Another TypeContent","41,42",123

Upvotes: 3

AlwaysLearning
AlwaysLearning

Reputation: 796

Well, this being Perl you have several ways to do it:

  • brute force. Slurp the file in, and track when you come across an opening < brace. When you do, start collecting name/value pairs. When you see a closing brace, stop. Not as easy as it sounds because you have to handle possibly nested XML elements.
  • slight force. Load the file using a basic library like XML::Simple and then spit it out in a format of your choosing using Data::Dumper. The former gives you a hash and then you can play with the keys and values all your like.
  • Use a XML library. There are quite a few in CPAN, ranging from ones that are very close to the underlying libxml semantics and ones that are very abstract.

Upvotes: 1

Related Questions