RaySF
RaySF

Reputation: 1319

How to make a parser for a web crawler maintainable

I wrote a Ruby web-crawler that retrieves data from a third-party website. I am using Nokogiri to extract information based on a specific CSS div and specific fields (accessing children and elements of the nodes I extract).

From time to time, the structure of the third-party website changes which breaks the crawler (element[1].children[2] might need to be changed to element[2].children[0]).

So far, I have a utility that prints the structure of the node I extract which allows me to quickly fix the parser when the structures change. I also have an automated process that controls that it can extract "some" values.

I would like to know if there is a more elegant way to deal with this issue. How would one write a crawler that is easy to maintain?

Upvotes: 0

Views: 99

Answers (2)

David Grayson
David Grayson

Reputation: 87486

You should try to use the data and metadata of the web page to find the element you care about as much as possible instead of using element index numbers like you are doing.

The "class" and "id" attributes are a good way to do it. Nokogiri has XPath features that should make it easy to select elements based on those. If that is not possible, you could try looking at the content of the page around the element, e.g. if you are looking for a weight and you know it is in a table, you could search for strings ending with "kg". It's hard to give super-specific tips without seeing the document you are parsing.

I recommend that your crawler should check the data is retrieving and raise an exception (or show a warning) if the data looks wrong.

Upvotes: 1

pguardiario
pguardiario

Reputation: 55002

Use CSS. For example the price of a product will almost always be:

page.at('#price, .price').text

The site can change layout (theme) and this will still work.

Upvotes: 1

Related Questions