Removing XML entities from string in Ruby

Question

I try to parse RSS chaanal with simple-rss lib.

Unfortunately I got a lot of garbage in node:

 <p>
some decryption

</p>
 <a href="http://url.com/trac/xxx/wiki/foo?action=diff&amp;version=28">(diff)</a>

I need to retrieve text ("some description") and optionally url.

What is the best way to do it? Regexp (if this is answer could You give me example, please?)?

Chirantan · Accepted Answer

Thats not garbage. It is just HTML sanitized string of characters. And I am assuming by the url, you mean with the html tags(). Following code should work.

require 'cgi'
description = "</p> <a href=\"http://url.com/trac/xxx/wiki/foo?action=diff&amp;version=28\">(diff)</a>"
CGI.unescapeHTML(description) # => 
 (diff)

If you don't want the html tags, there are various ways to just obtain the url. A simple regex for the url should work, which I leave it to you to figure out.(Hint - Google)

Removing XML entities from string in Ruby

Answers (1)

Related Questions