Martin
Martin

Reputation: 40573

What language/tool should I use for HTML parsing?

I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.).

Considering that I have no constraints regarding the technology, language or tool that I can use, what are your suggestions to easily parse and extract data from HTML pages? I have tried HTML Agility Pack, BeautifulSoup, and even these tools aren't perfect (HTML Agility Pack is buggy, and BeautifulSoup parsing engine doesn't work with the pages I am passing to it).

Upvotes: 6

Views: 4753

Answers (6)

Boris Ivanov
Boris Ivanov

Reputation: 4254

Any language which works with HTML on DOM level is good.

for perl it is HTML::TreeBuilder module.

Upvotes: 0

cuneytykaya
cuneytykaya

Reputation: 581

As language Java and as a open source library Jsoup will be a pretty solution for you.

Upvotes: 2

Stewart Robinson
Stewart Robinson

Reputation: 3539

I think hpricot (linked by Colin Pickard) is ace. Add scrubyt to the mix and you get a great html scraping and browsing interface with the text matching power of Ruby http://scrubyt.org/

here is some example code from http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb

require 'rubygems'
require 'scrubyt'

# Simple exmaple for scraping basic
# information from a public Twitter
# account.

# Scrubyt.logger = Scrubyt::Logger.new

twitter_data = Scrubyt::Extractor.define do
  fetch 'http://www.twitter.com/scobleizer'

  profile_info '//ul[@class="about vcard entry-author"]' do
    full_name "//li//span[@class='fn']"
    location "//li//span[@class='adr']"
    website "//li//a[@class='url']/@href"
    bio "//li//span[@class='bio']"
  end
end

puts twitter_data.to_xml

Upvotes: 2

Ionuț G. Stan
Ionuț G. Stan

Reputation: 179119

You may try PHP's DOMDocument class. It has a couple of methods for loading HTML content. I usually make use of this class. My advises are to prepend a DOCTYPE element to the HTML in case it hasn't one and to inspect in Firebug the HTML that results after parsing. In some cases, where invalid markup is encountered, DOMDocument does a bit of rearrangement of the HTML elements. Also, if there's a meta tag specifying the charset inside the source be careful that it will be used internally by libxml when parsing the markup. Here's a little example

$html = file_get_contents('http://example.com');

$dom = new DOMDocument;
$oldValue = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($oldValue);

echo $dom->saveHTML();

Upvotes: 0

Colin Pickard
Colin Pickard

Reputation: 46653

hpricot may be what you are looking for.

Upvotes: 0

cletus
cletus

Reputation: 625087

You can use pretty much any language you like just don't try and parse HTML with regular expressions.

So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.

If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.

Upvotes: 5

Related Questions