Berny
Berny

Reputation: 155

Comparison of texts

I have database with 500+ articles, every 5 min php script check XML files with news. I need to ignore articles which I already have. And I need to check the similarity of news, because some people just rewrite it. For example:

One will write: "Hello, my name is John! How are you?"
Second will write: "Hello! How are you? My name is John!"

It isn't good example, but I have this problem. For comparing text I will use shingles algorithm. But how it better to do? I think check every article from xml with database every time isn't good.

Upvotes: 2

Views: 148

Answers (1)

Olaf Dietsche
Olaf Dietsche

Reputation: 74078

Since you have only 500+ articles, checking every 5 minutes shouldn't be a problem.

If you want to improve this regardless, you could add another table (md5 or sha1 hash, text source) and store the source, where you retrieved the text, plus some hash. When you check new articles, you could compare with the hashes instead, to see if you have already seen this article.

Upvotes: 2

Related Questions