Reputation: 614
I need to scrape some webpages and extract content from them. I'm planning to select some specific keywords and map the data that has some relationship b/w them. But I have no Idea, how I could do that. Could anyone suggest me some algorithms for doing it?.
For example I need to download some webpages about apples and map the relevant data about apples to it and store in database so that, if someone needs specific information about it, I could provide it fastly and accurately.
Also it would be helpful pointing out helpful libraries too. I'm planning to do it in python.
Upvotes: 2
Views: 1172
Reputation: 17134
Have a look at NLTK, Pattern or Orange modules.
As a start "Programming collective intelligence: building smart web 2. 0 applications" by Toby Segaran is a good book to read.
Upvotes: 1
Reputation: 16525
You could try algorithms based on term frequency–inverse document frequency TF-IDF, in Java I would recommend Solr ... well actually you could use Solr and access it with python see here
Upvotes: 1