Reputation:
I am trying to get values from already existing html table with exact td (cell). Can anyone help me with it?
The existing table's code is as below.
<table>
<tr><td class="key">FIRST NAME</td><td id="firstname" class="value">ALEXANDR</td></tr>
<tr><td class="key">SURNAME NAME</td><td id="surname" class="value">PUSHKIN</td></tr>
<tr><td class="key">EMAIL</td><td id="email" class="value">[email protected]</td></tr>
<tr><td class="key">TELEPHONE</td><td id="telephone" class="value">+991122334455</td></tr>
</table>
I tried this below perl script but it does not work.
$pp = get("http://www.domain.com/something_something");
$out[0]="/home/.../public_html/perl_output.txt";
($firstname) = ($str =~ /<td id="firstname" class="value">(.+?)<\/firstname/);
($surname) = ($str =~ /<td id="surname" class="value">(.+?)<\/surname/);
($email) = ($str =~ /<td id="email" class="value">(.+?)<\/email/);
($telephone) = ($str =~ /<td id="telephone" class="value">(.+?)<\/telephone/);
print "First Name: $firstname \n";
print "Last Name: $surname \n";
print "Email: $email \n";
print "Telephone: $telephone \n";
exit;
Can anyone guide me?
Upvotes: 0
Views: 358
Reputation: 451
First, you really should use an XML parser.
Now to some possible reasons why the code does not work:
Your regular expressions expect an ending tag, e.g. </firstname
which does not exist in your HTML.
If the HTML is plain and reliable and you really want a regex it should better look like this:
m/<td
[^>]+ # anything but '>'
id="firstname"
[^>]+ # anything but '>'
>
([^<]+?) # anything but '<'
<
/xms;
This does not take into account case insensitivity of HTML, decoding of HTML-entities, other allowed quote characters.
Upvotes: 0
Reputation: 54381
Because Web::Scraper is for HTML documents, this is not going to work with the website that OP wants to scrape. It uses XML. See my other answer for a solution that deals with XML.
Don't try to parse HTML with regular expressions! Use an HTML parser instead.
For web scraping I prefer Web::Scraper. It does everything from fetching the page to parsing the content in a very simple DSL.
use strict;
use warnings;
use Web::Scraper;
use URI;
use Data::Dumper;
my $people = scraper {
# this will parse all tables and put the results into the key people
process 'table', 'people[]' => scraper {
process '#firstname', first_name => 'TEXT'; # grab those ids
process '#lastname', last_name => 'TEXT'; # and put them into
process '#email', email => 'TEXT'; # a hashref with the
process '#telephone', phone => 'TEXT'; # 2nd arg as key
};
result 'people'; # only return the people key
};
my $res = $people->scrape( URI->new("http://www.domain.com/something_something") );
print Dumper $res;
__DATA__
$VAR1 = [
{
firstname => 'ALEXANDR',
lastname => 'PUSHKIN',
email => '[email protected]',
phone => '+991122334455',
}
]
If one of the fields, like email or firstname occur multiple times in one table, you can use an array reference for that. In that case the document's HTML would not be valid because of the double id
s though. Use a different selector and pray it works.
process '#email', 'email[]' => 'TEXT';
Now you'll get this kind of structure:
{
email => [
'[email protected]',
'[email protected]',
],
}
Upvotes: 4
Reputation: 54381
Since it came out that the document is actually XML, here is a solution that uses an XML parser to deal with it, and also takes into account multiple fields. XML::Twig is very useful for this, and it even lets us download the document.
use strict;
use warnings;
use XML::Twig;
use Data::Printer;
my @docs; # we will save the docs here
my $twig = XML::Twig->new(
twig_handlers => {
'oai_dc:dc' => sub {
my ($t, $elt) = @_;
my $foo = {
# grab all elements of type 'dc:author" inside our
# element and call text_only on them
author => [ map { $_->text_only } $elt->descendants('dc:author') ],
email => [ map { $_->text_only } $elt->descendants('dc:email') ],
};
push @docs, $foo;
}
}
);
$twig->parseurl("http://ejeps.com/index.php/ejeps/oai?verb=ListRecords&metadataPrefix=oai_dc");
p @docs;
__END__
[
[0] {
author [
[0] "Nazila Isgandarova"
],
email [
[0] "[email protected]"
]
},
[1] {
author [
[0] "Mette Nordahl Grosen",
[1] "Bezen Balamir Coskun"
],
email [
[0] "[email protected]",
[1] "[email protected]"
]
},
# ...
Upvotes: 1