Reputation: 6835
I try to parse RSS chaanal with simple-rss lib.
Unfortunately I got a lot of garbage in node:
<description><p>
some decryption
</p>
<a href="http://url.com/trac/xxx/wiki/foo?action=diff&amp;version=28">(diff)</a></description>
I need to retrieve text ("some description") and optionally url.
What is the best way to do it? Regexp (if this is answer could You give me example, please?)?
Upvotes: 0
Views: 742
Reputation: 15634
Thats not garbage. It is just HTML sanitized string of characters. And I am assuming by the url, you mean with the html tags(<a></a>
). Following code should work.
require 'cgi'
description = "</p> <a href=\"http://url.com/trac/xxx/wiki/foo?action=diff&amp;version=28\">(diff)</a>"
CGI.unescapeHTML(description) # => </p> <a href="http://url.com/trac/xxx/wiki/foo?action=diff&version=28">(diff)</a>
If you don't want the html tags, there are various ways to just obtain the url. A simple regex for the url should work, which I leave it to you to figure out.(Hint - Google)
Upvotes: 3