Reputation: 171218
My application is reading many public RSS feeds which are not under my control. Unfortunately I have encountered various problems (like XML entities inside of CDATA tags which should just be literal chars, missing required elements, ...). I was able work around all of those by adding detection routines. Now I found a feed which is sending guids, but always the same 10 guids for different articles! How am I now supposed to detect new feed items, now?
And this is what I mean by Internet-safe: I need an RSS lib which can shield me from malformed feeds, works with feed with 1500 entires (have seen that too...), and which does reliable new-item detection. Can anyone share a recommendation for .NET?
Upvotes: 0
Views: 113
Reputation: 524
New item detection is a pain, but hashing can help out alot. Personally I prefer to get a hash for the entire file and store it for matching. Then as you hit each item hash the innerxml and check to see if you already have it. Hashing each item will help you manage updates when the GUID is the same as well. I used to try to use the GUID but it's just not worth the pain. Here's an md5 function I used in an rss engine under .net 2.0, not sure if there is a better way under 4.0.
Imports System.Security.Cryptography
Function getMD5Hash(ByVal strToHash As String) As String
Dim md5Obj As New MD5CryptoServiceProvider
Dim bytesToHash() As Byte = System.Text.Encoding.ASCII.GetBytes(strToHash)
bytesToHash = md5Obj.ComputeHash(bytesToHash)
Dim strResult As String = ""
For Each b As Byte In bytesToHash
strResult += b.ToString("x2")
Next
Return strResult
End Function
Can't help with the malformed feeds that's just a fact of parsing rss. I've seen xml cleaners as desktop apps but not as a library. Generally I log a parse error and alert if the same feed errors more then once over 24 hours. I've seen a number of feeds have issues for a few hours, i'm sure due to a code change that later got fixed.
Google seems to take this approach also. If the feed is borked they keep trying until it get's fixed, not sure how often they actually retry > a few hours <= day. Found that out by watching a feed that was broke using google's atom url to see when the newest item finally showed up. it was hours after I noticed the feed was fixed.
Here's a url that I used to check google for the items appearance. http://www.google.com/reader/atom/feed/[feedurl]?n=20
Don't use XmlDocument for RSS apps, stick with XmlReader or XmlPathDocument. XmlPathDocument + navigator is nice for detecting new nodes you haven't coded for.
Upvotes: 1
Reputation: 138970
RSS Streams must be XML, otherwise they're not valid, and will probably be discarded by standard RSS readers.
Are you reading these feeds with the .NET XmlDocument or XmlReader? In this case, you shouldn't have to do "work arounds".
Upvotes: 0