Teoman shipahi
Teoman shipahi

Reputation: 23122

Best practice to keep RSS feeds unique in sql database

I am working on a project which shows rss feeds from different sites. I keep them in the database, every 3 hours my program fetches and inserts them into sql database. I want unique records for providers not to show duplicate content.

But problem is some providers do not give GUID field, and some others gives GUID field but not pubdate.. And some others does not even give GUID or PubDate just title and link.

So to keep rss feeds uniqe in sql server what would be the best way?

Should I check for first guid, then pubbdate, then link, then title? Will it be to good practice to compare link fields in SQL to check uniqueness?

Thanks.

Upvotes: 3

Views: 537

Answers (2)

PirateApp
PirateApp

Reputation: 6230

Currently, this is what I am doing

# If we have a GUID in the feed item, use it as the feed_item_id else use link
# http://www.詹姆斯.com/blog/2006/08/rss-dup-detection
def build_feed_item_id(entry):
    guid = trim(entry.get('id', ''))
    if len(guid):
        feed_item_id = guid
    else:
        feed_item_id = trim(entry.get('link', ''))
    return hashlib.md5(feed_item_id.encode(encoding)).hexdigest()

It is based on the reasoning mentioned in the blog post linked in the snippet which I ll reference here in case the post gets taken down

RSS 2.0 has a guid element that fits the bill perfectly, but it’s not a required element and many feeds don’t use it.

I can’t say for sure what algorithms applications are using, but after running 150 tests on more than 20 different aggregators, I think have a fair idea how many of them work.

As you would expect, for most the guid is considered the key element for determining duplicates. This is pretty straightforward. If two items have the same guid they are considered duplicates; if their guids differ then they are considered different.

If a feed doesn’t contain guids, though, aggregators will most likely resort to one of three general strategies – all of which involve the link element in some way.

Technique 1

  • Guid must be unique
  • If a post doesnt have guid, consider link, title, description or any combination of them to get a unique hash

Technique 2

  • Link must be unique
  • If both link and guid are missing, check other elements such as title or description

Technique 3

  • Combination of link + title or link + description must be unique

The most obvious recommendation is that you should always include guids in your feeds.

In addition, I would recommend you also include a unique link element for each item in your feed, to allow for aggregators that don’t handle guids very well. No two items should ever have the same link element, and ideally a link should never change (if you do update a link, be aware that it could show up as a new item for some aggregators).

Finally, although this is not essential, it is advisable that you refrain from updating your article titles if at all possible. There are at least two aggregators that will consider an entry with an altered title to be a completely new post – somewhat annoying to readers when all you’ve done is make a spelling correction in your title.

Upvotes: 0

Menefee
Menefee

Reputation: 1495

I would develop a routine that takes certain key parameters like the title, source and body and then combines them to create a CRC hash. Then store the hash as an attribute with the feed and check for a matching hash before adding a new feed.

I'm not sure what your environment contraints are but here is an example for calculating CRC-32 in C#: http://damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_net

Upvotes: 2

Related Questions