Intelligently Detect Duplicate Content Using PHP

Question

I have built a web scraper that takes a website or RSS feed, parses said contents of the feed and or website, extracts all of the appropriate information and then saves it into a database. This is a personal experiment to see if I can build an intelligent and anonymous web scraper with no real purpose just to see how advanced I can go and then I will be open sourcing the code for others to learn from.

The problem is I am scraping at present 3 news websites. When it comes to breaking news, there is a high chance all 3 websites (especially if it's a big story) will all be writing their own interpretations of the news, but ultimately it's the same news.

I have been trying to come up with a solution that can detect as best as it can when an article being pulled in has already been spoken about and imported from another news website and perhaps the link is associated with the story (other sites also wrote about this: link1, link2).

Is there a tried and tested way of detecting if one or more pieces of content are effectively the same? I've written some pseudo-code, but unfortunately I'm not a very smart developer to take it and make it something that works.

Here is my thinking:

A link to a website is parsed
Generic words are stripped out and keywords left in (company names, countries, etc)
The remaining words are then counted and a score is calculated

That's where my thinking hits a roadblock. How do I efficiently create a snapshot of a page and then compare it to pre-existing content in my database I've already imported? This is how I think it needs to be done.

Perhaps I am over-thinking this and I merely need to check if articles have similar titles?

Anathema.Imbued · Accepted Answer

My approach would be analysis of individual scrap results from a single website, omit out credentials and various other items that are common.

Now. out of the rest create a profiling of each newstory, how ? we can do this by a weighted priority to terms found in individual stories, how ? like giving weightage to non-dictionary terms ( which would be company name, individual names) giving weightage to cityname, region. Matching these non-dictionary terms with each other, and so going with technical terms.

My exp says that matching up non-dictionary terms like these would solve atleast 50% of your problem, plus its all about making a profile.

Intelligently Detect Duplicate Content Using PHP

Answers (1)

Related Questions