Reputation: 61
Looking for a way (client-side or server-side) to detect the actual content part of a web page and remove its header, footer & navigation. Something similar to the way the Amazon's "Send to Kindle" add-on for Firefox works. The solution can be either client-side (JavaScript) or server-side. I understand that it can't be a 100% reliable solution but I was wondering if there's a library/algorithm somebody already used for this type of problem.
Upvotes: 0
Views: 325
Reputation: 2415
Either check which <div>
tag has the most content (really unreliable) or make a list of all class names/ ids that are used by major sites to mark their main content-markup and save them in a database. you should be able to do with a couple thousand rows and then parse the pages using DOM to check with class name is available.
This might not be the fastest solution, but you could speed it up, if you map certain sites, you know which class names they use.
EDIT: You will still have to refine your algorithm. For example:
Upvotes: 1