user5934920
user5934920

Reputation:

Get value from HTML table with PERL

I am trying to get values from already existing html table with exact td (cell). Can anyone help me with it?

The existing table's code is as below.

<table>
<tr><td class="key">FIRST NAME</td><td id="firstname" class="value">ALEXANDR</td></tr>
<tr><td class="key">SURNAME NAME</td><td id="surname" class="value">PUSHKIN</td></tr>
<tr><td class="key">EMAIL</td><td id="email" class="value">[email protected]</td></tr>
<tr><td class="key">TELEPHONE</td><td id="telephone" class="value">+991122334455</td></tr>
</table> 

I tried this below perl script but it does not work.

$pp = get("http://www.domain.com/something_something");
$out[0]="/home/.../public_html/perl_output.txt";
($firstname) = ($str =~ /<td id="firstname" class="value">(.+?)<\/firstname/);
($surname) = ($str =~ /<td id="surname" class="value">(.+?)<\/surname/);
($email) = ($str =~ /<td id="email" class="value">(.+?)<\/email/);
($telephone) = ($str =~ /<td id="telephone" class="value">(.+?)<\/telephone/);

print "First Name: $firstname \n";
print "Last Name: $surname \n";
print "Email: $email \n";
print "Telephone: $telephone \n";

exit;

Can anyone guide me?

Upvotes: 0

Views: 358

Answers (3)

Helmut Wollmersdorfer
Helmut Wollmersdorfer

Reputation: 451

First, you really should use an XML parser.

Now to some possible reasons why the code does not work:

Your regular expressions expect an ending tag, e.g. </firstnamewhich does not exist in your HTML.

If the HTML is plain and reliable and you really want a regex it should better look like this:

m/<td    
  [^>]+    # anything but '>'
  id="firstname"
  [^>]+    # anything but '>'
  >
  ([^<]+?) # anything but '<'
  <
/xms;

This does not take into account case insensitivity of HTML, decoding of HTML-entities, other allowed quote characters.

Upvotes: 0

simbabque
simbabque

Reputation: 54381

This answer solves the problem described in the question, but not the actual problem OP has revealed in the comments.

Because Web::Scraper is for HTML documents, this is not going to work with the website that OP wants to scrape. It uses XML. See my other answer for a solution that deals with XML.


Don't try to parse HTML with regular expressions! Use an HTML parser instead.

For web scraping I prefer Web::Scraper. It does everything from fetching the page to parsing the content in a very simple DSL.

use strict;
use warnings;
use Web::Scraper;
use URI;
use Data::Dumper;

my $people = scraper {
    # this will parse all tables and put the results into the key people
    process 'table', 'people[]' => scraper {
        process '#firstname', first_name => 'TEXT'; # grab those ids
        process '#lastname',  last_name  => 'TEXT'; # and put them into
        process '#email',     email      => 'TEXT'; # a hashref with the
        process '#telephone', phone      => 'TEXT'; # 2nd arg as key
    };
    result 'people'; # only return the people key
};
my $res = $people->scrape( URI->new("http://www.domain.com/something_something") );

print Dumper $res;

__DATA__
$VAR1 = [
  {
    firstname => 'ALEXANDR',
    lastname => 'PUSHKIN',
    email => '[email protected]',
    phone => '+991122334455',
  }
]

If one of the fields, like email or firstname occur multiple times in one table, you can use an array reference for that. In that case the document's HTML would not be valid because of the double ids though. Use a different selector and pray it works.

 process '#email', 'email[]' => 'TEXT';

Now you'll get this kind of structure:

{
  email => [
   '[email protected]',
   '[email protected]',
  ],
}

Upvotes: 4

simbabque
simbabque

Reputation: 54381

Since it came out that the document is actually XML, here is a solution that uses an XML parser to deal with it, and also takes into account multiple fields. XML::Twig is very useful for this, and it even lets us download the document.

use strict;
use warnings;
use XML::Twig;
use Data::Printer;

my @docs; # we will save the docs here
my $twig = XML::Twig->new(
    twig_handlers => {
        'oai_dc:dc' => sub {
            my ($t, $elt) = @_;

            my $foo = {
                # grab all elements of type 'dc:author" inside our 
                # element and call text_only on them
                author => [ map { $_->text_only } $elt->descendants('dc:author') ],
                email => [ map { $_->text_only } $elt->descendants('dc:email') ],
            };

            push @docs, $foo;
        }
    }
);

$twig->parseurl("http://ejeps.com/index.php/ejeps/oai?verb=ListRecords&metadataPrefix=oai_dc");

p @docs;

__END__

[
    [0]  {
        author   [
            [0] "Nazila Isgandarova"
        ],
        email    [
            [0] "[email protected]"
        ]
    },
    [1]  {
        author   [
            [0] "Mette Nordahl Grosen",
            [1] "Bezen Balamir Coskun"
        ],
        email    [
            [0] "[email protected]",
            [1] "[email protected]"
        ]
    },
# ...

Upvotes: 1

Related Questions