Kevin Eder
Kevin Eder

Reputation: 309

Using Natural Language Processing to parse websites

I'm interested generally in the data mining by crawling websites, but I've never been able to find a lot of documentation on the process I'd really like to implement. I'm very keen on the idea of writing a base set of rules that define how to parse a page, then training the tool when it makes mistakes.

Let's say I want to parse menus from restaurant websites. I'd like to create a tool that would allow me to write a set of rules that show generally where the menu items + prices are. Then, I could run the tool and tell it which menu items it parsed out correctly, and which ones were wrong. The tool would then "learn" from these corrections, and the next time I run it, I'd get better results.

I've looked a bit at the NLTK toolkit, and it's got me wondering if the best way to solve this problem is with a NLP tool, like NLTK. Can anyone point me in the correct direction for finding books and (ideally) libraries that can help me get started? Is NLP the way to go? Thanks!

Upvotes: 3

Views: 983

Answers (1)

Fred Foo
Fred Foo

Reputation: 363817

I'm very keen on the idea of writing a base set of rules that define how to parse a page

What exactly do you mean by "parsing a page"? Parsing the sentences in a page? Doing structured information extraction?

The tool would then "learn" from these corrections, and the next time I run it, I'd get better results.

This is the problem of active learning, which is pretty advanced stuff. You'll need a machine learning toolkit; which one depends on what exactly you want to do: make parse trees or extraction salient information. NLTK has some stochastic parser support, I believe.

Upvotes: 2

Related Questions