I need to recognize content in a page - to do something as so http://www.alchemyapi.com/api/text/ (I need to get the HTML so I cant use this API) What logic can I use to accomplish this? (Coding language is not matter) Here what I did (with a good result) - needs a lot more fixes... Find the most text in page so don't have a breaking tags - ignore inline tags (span, b, etc...) Go up one level and count breaking tags (br, p, div, etc...) Go up another level and count tags Compare tags count from step 2 with step 3 If there is a lot of different we stop here - if not we go to step 3

Reputation: 1070

Extract content from a page

I need to recognize content in a page - to do something as so http://www.alchemyapi.com/api/text/ (I need to get the HTML so I cant use this API)

What logic can I use to accomplish this? (Coding language is not matter)

Here what I did (with a good result) - needs a lot more fixes...

Find the most text in page so don't have a breaking tags - ignore inline tags (span, b, etc...)
Go up one level and count breaking tags (br, p, div, etc...)
Go up another level and count tags
Compare tags count from step 2 with step 3
If there is a lot of different we stop here - if not we go to step 3

Upvotes: 0