Nokogiri - parsing multi-line `` tag as link and text

Question

I am using Nokogiri to parse an RSS feed for a podcast. I am trying to grab a particular piece of data containing a link to the episode, so I'm using Nokogiri to parse the XML response for the RSS feed.

The relevant bit is below:


  An awesome title!
  ...
  
    http://www.foobar.com/episodes/1

Nokogiri appears to be having a hard time grabbing the tag though; I am able to get the tag as a Nokogiri::Node object, and I can grab the title just fine with node.css('title').text, but when I try the same with node.css('link').text, I get a blank string.

I tried calling node.children.to_a to examine all of the children in this node, and I noticed something odd: the text inside the tag is being parsed as a separate child:

[0] = {Nokogiri::XML::Element} An awesome title!

[1] = {Nokogiri::XML::Element} 
[2] = {Nokogiri::XML::Text} http://www.foobar.com/episodes/1

Is there a way I can help Nokogiri properly parse this multi-line tag so that I can grab the text inside?

UPDATE: Here is the exact code I'm executing when I run into the issue.

require 'open-uri'
doc = Nokogiri::HTML(open('https://rss.acast.com/abroadinjapan')) # Returns Nokogiri::HTML::Document
node = doc.css('//item').first # Returns Nokogiri::XML::Element
node.css('title').text # Returns "Abroad in Japan: Two weeks more in Japan!"
node.css('link').text # Returns ""
node.css('link').inner_text # Also returns "" - saw this elsewhere and thought I'd try it
node.children.to_a # Result, parsed by RubyMine for readability:

result = Array (14 elements)
 [0] = {Nokogiri::XML::Element} Abroad in Japan: Two weeks more in Japan!

 [1] = {Nokogiri::XML::Element} Chris and Pete return and they've planned out a very different route through Northern Japan.&nbsp;


Our Google Map can be found here:&nbsp;
goo.gl/3t4t3q&nbsp;


Get in touch:&nbsp;abroadinjapanpodcast@gmail.com&nbsp;
More Abr...
 [2] = {Nokogiri::XML::Element} 
 [3] = {Nokogiri::XML::Element} 
 [4] = {Nokogiri::XML::Element} Wed, 16 May 2018 21:00:00 GMT
 [5] = {Nokogiri::XML::Element} 01:00:00
 [6] = {Nokogiri::XML::Element} 
 [7] = {Nokogiri::XML::Element} no
 [8] = {Nokogiri::XML::Element} full
 [9] = {Nokogiri::XML::Element} 
 [10] = {Nokogiri::XML::Element} Chris and Pete return and they've planned out a very different route through Northern Japan. 

Our Google Map can be found here: 
goo.gl/3t4t3q 


Get in touch: abroadinjapanpodcast@gmail.com 
More Abroad In Japan shows available below, do subscribe, rate and review us on iTunes, and please tell your friends! 


http://www.radiostakhanov.com/abroadinjapan/
]]>
 [11] = {Nokogiri::XML::Element} 
 [12] = {Nokogiri::XML::Text} https://www.acast.com/abroadinjapan/abroadinjapan-twoweeksmoreinjapan-
                
 [13] = {Nokogiri::XML::Element}

NOTE: One of the URLs above uses a URL shortener, which SO doesn't like, so I replaced it with foobar.com.

Casper · Accepted Answer

The fix is a lot simpler than you would think. An RSS feed is not valid HTML, but it works with XML:

doc = Nokogiri::XML(open('...'))

Ruby also has a module named RSS, which might be better suited for something like this:

require 'rss'
doc = RSS::Parser.parse(open('...'))
doc.items.first.link
=> "https://...."

Nokogiri - parsing multi-line `<link>` tag as link and text

Answers (1)

Related Questions

Nokogiri - parsing multi-line `&lt;link&gt;` tag as link and text

Answers (1)

Related Questions

Nokogiri - parsing multi-line `<link>` tag as link and text