Reputation: 1070
I need to recognize content in a page - to do something as so http://www.alchemyapi.com/api/text/ (I need to get the HTML so I cant use this API)
What logic can I use to accomplish this? (Coding language is not matter)
Here what I did (with a good result) - needs a lot more fixes...
Upvotes: 0
Views: 304
Reputation: 187
You need a parser for navigate the DOM, in the NuGet packages you can find some helpful parser tools like this
Upvotes: 0
Reputation: 100050
Look for the Boilerpipe library. It is a comprehensive solution.
Using the Boilerpipe library, you can specify the output as HTML. So you get the main content(the article) while still preserving its HTML tags.
Upvotes: 3