Vladimir
Vladimir

Reputation: 61

Detect the actual content in a web page (ignore header, footer, navigation etc.)

Looking for a way (client-side or server-side) to detect the actual content part of a web page and remove its header, footer & navigation. Something similar to the way the Amazon's "Send to Kindle" add-on for Firefox works. The solution can be either client-side (JavaScript) or server-side. I understand that it can't be a 100% reliable solution but I was wondering if there's a library/algorithm somebody already used for this type of problem.

Upvotes: 0

Views: 325

Answers (1)

LuigiEdlCarno
LuigiEdlCarno

Reputation: 2415

Either check which <div> tag has the most content (really unreliable) or make a list of all class names/ ids that are used by major sites to mark their main content-markup and save them in a database. you should be able to do with a couple thousand rows and then parse the pages using DOM to check with class name is available.

This might not be the fastest solution, but you could speed it up, if you map certain sites, you know which class names they use.

EDIT: You will still have to refine your algorithm. For example:

  • how do you handle multiple of those stored class names being present
  • what do you do, if none is present (show the whole page?, Show only the biggest div?

Upvotes: 1

Related Questions