Using Natural Language Processing to parse websites

Question

I'm interested generally in the data mining by crawling websites, but I've never been able to find a lot of documentation on the process I'd really like to implement. I'm very keen on the idea of writing a base set of rules that define how to parse a page, then training the tool when it makes mistakes.

Let's say I want to parse menus from restaurant websites. I'd like to create a tool that would allow me to write a set of rules that show generally where the menu items + prices are. Then, I could run the tool and tell it which menu items it parsed out correctly, and which ones were wrong. The tool would then "learn" from these corrections, and the next time I run it, I'd get better results.

I've looked a bit at the NLTK toolkit, and it's got me wondering if the best way to solve this problem is with a NLP tool, like NLTK. Can anyone point me in the correct direction for finding books and (ideally) libraries that can help me get started? Is NLP the way to go? Thanks!

Using Natural Language Processing to parse websites

Answers (1)

Related Questions