find repeat patterns in webpages in ruby

I am trying to find a way of finding repeat patterns in webpages so that i can extract the content into my database.

EDIT : I don't know what the repeat pattern is before hand so i can't just search for a given pattern via a regex or something.

For example if you have 10 sites selling cars but the sites are all different, looking on each site the cars are listed in html in a repeated way down the page for this site.

The other sites will be listed in a different way but each with a repeated pattern.

Does anyone know how, or have any experience of this sort of thing?

i love ruby so was hoping to do it in ruby if any one has or knows of any libs / gems that may help me out ?

Upvotes: 5

Answers (2)

Lee Hambley

Reputation: 6370

Rick, machine pattern matching is a complicated topic, and not something that you'll find a good library for out of the box on Ruby.

Kyle's answer was a start, once you get the page with Ruby, the typical techology for this would be xpath or "The XML Path Language".

Using Xpath you can write simple selectors that will extract every item matching a pattern, for instance, every link on an HTML document might be //a, every h1 would be //h1, and every image directly inside a div, where the image has the class "car" would be something like: //div/image[class="car"].

The result of the XPath is an enumerable list of each item, you can then query for sub-elements, get the content() of the elements, and build relationships to extract the data you need.

The go-to library for Ruby is called Nokogiri, and is avaiable as a gem - the direct documentation is a little weak, but it's all covered there if you know what to look for.

Some libraries for Ruby combine the crawling, with an easy way to access the underlying HTML/XML as a Nokogiri document, one such example is Anemone which is a "framework for building web spiders in Ruby" - and I can recomment it very highly.

Upvotes: 2

Kyle Sletten

Reputation: 5413

In Ruby, if you want to get the text of a webpage all you have to do is use the Net::HTTP namespace. The get method returns a string representation of the webpage.

Net::HTTP.get 'http://www.target-site.com', '/target-page.html'

You're probably going to want to use some sort of XML Parser after that to make a model of the page and navigate over it. I've heard good things about Hpricot.

Upvotes: -1

find repeat patterns in webpages in ruby

Answers (2)

Related Questions