How to make a parser for a web crawler maintainable

Question

I wrote a Ruby web-crawler that retrieves data from a third-party website. I am using Nokogiri to extract information based on a specific CSS div and specific fields (accessing children and elements of the nodes I extract).

From time to time, the structure of the third-party website changes which breaks the crawler (element[1].children[2] might need to be changed to element[2].children[0]).

So far, I have a utility that prints the structure of the node I extract which allows me to quickly fix the parser when the structures change. I also have an automated process that controls that it can extract "some" values.

I would like to know if there is a more elegant way to deal with this issue. How would one write a crawler that is easy to maintain?

David Grayson · Accepted Answer

You should try to use the data and metadata of the web page to find the element you care about as much as possible instead of using element index numbers like you are doing.

The "class" and "id" attributes are a good way to do it. Nokogiri has XPath features that should make it easy to select elements based on those. If that is not possible, you could try looking at the content of the page around the element, e.g. if you are looking for a weight and you know it is in a table, you could search for strings ending with "kg". It's hard to give super-specific tips without seeing the document you are parsing.

I recommend that your crawler should check the data is retrieving and raise an exception (or show a warning) if the data looks wrong.

How to make a parser for a web crawler maintainable

Answers (2)

Related Questions