oljones
oljones

Reputation: 135

parse RSS feed with PERL using XML:LibXML

I am in the unfortunate position of needing to parse a RSS feed since there is no other way to obtain the data. I have a Perl script that has worked before to parse an XML file. I figured I could modify it to do the same for the RSS feed so I can get the data into a format that is easier use. with that in mind i have modified my file. But it doesn't actually seem to be finding any data to pull from the feed. Here is the core of the code.

foreach my $channel ($root->findnodes('channel')) {
  foreach my $item ($root->findnodes('item')) {
    my $guid = $item->findvalue('guid');
    my $title = $item->findvalue('title');
    my $link = $item->findvalue('link');
    my $description = $item->findvalue('description');
    my $pubdate = $item->findvalue('pubdate');
    print DATA "INSERT INTO events VALUES ( \"$guid\", \"$title\", \"$link\",\"$description\", \"$pubdate\" ); \n";
  }
}

Any ideas?

Upvotes: 0

Views: 1737

Answers (1)

Grant McLean
Grant McLean

Reputation: 6998

Putting aside for one moment the excellent suggestion from Richard Simões to use XML::RSS ...

I think the main problem you're hitting is to do with XML namespaces. Consider this line of your script:

$root->findnodes('channel')

It is looking for an element of type 'channel', but your source document probably doesn't contain such an element. What you should be looking for is something like: an element of type 'channel' in the namespace identified by the URI 'http://purl.org/rss/1.0/'.

Working with namespaces is fiddly. There are two types: a default namespace (eg: xmlns="http://purl.org/rss/1.0/"); and namespaces declared with a prefix (e.g.: xmlns:rss="http://purl.org/rss/1.0/"). In either case, the only thing that matters is the namespace URI. The prefix declared in the document (e.g.: 'rss:') is irrelevant to your script.

To use namespaces with libxml, you need to declare your own prefix for each namespace URI and then use that prefix in your calls to findnodes. You can choose a prefix that is the same as the one in the document or different - it doesn't matter as long as the URI is the same. You need to use an XML::LibXML::XPathContext object to associate namespace URIs with prefixes and then route your queries through that context object.

This is a version of your script that's probably closer to what you want.

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML;
use XML::LibXML::XPathContext;

my $parser = XML::LibXML->new();
my $doc    = $parser->parse_file('slashdot.rss');
my $root   = $doc->documentElement();

my $xc     = XML::LibXML::XPathContext->new( $root );
$xc->registerNs( rss => 'http://purl.org/rss/1.0/' );

foreach my $channel ($xc->findnodes('rss:channel')) {
    foreach my $item ($xc->findnodes('rss:item')) {
        my $guid = $xc->findvalue('rss:guid', $item);
        my $title = $xc->findvalue('rss:title', $item);
        my $link = $xc->findvalue('rss:link', $item);
        my $description = $xc->findvalue('rss:description', $item);
        my $pubdate = $xc->findvalue('rss:pubDate', $item);
        print "INSERT INTO events VALUES ( \"$guid\", \"$title\", \"$link\",\"$description\", \"$pubdate\" ); \n";
    }
}

The document you're trying to parse probably uses a different version of RSS and therefore a different RSS namespace URI - that's just one of many reasons to use an RSS module rather than try to do it manually.

As ikegami pointed out, interpolating values into SQL is really a poor idea. In your example you are generating SQL with double-quoted string literals (you probably meant to use single quotes). This will fail if any of the values you extract from RSS contain a double quote character. Single and double quote characters are extremely likely to occur in RSS.

Upvotes: 4

Related Questions