Firebringer
Firebringer

Reputation: 13

How to correctly parse this bad html in Nokogiri?

I'm trying to parse this HTML with Nokogiri:

<div class="times">
<span style="color:"><span style="padding:0 ">&lrm;</span><!--  -->16:45&lrm;</span>
<span style="color:"><span style="padding:0 "> &nbsp;&lrm;</span><!--  -->19:30&lrm;</span> 
<span style="color:"><span style="padding:0 "> &nbsp;&lrm;</span><!--  -->22:10&lrm;</span>
</div>

I only want to get the times, inserted in an array.

I set up a gsub like this:

 block.css('div.times span').text.gsub(" ","").gsub("&nbsp","")

But then I end up with a single string and I'm kind of stuck. Is there an efficient way to do this?

Upvotes: 1

Views: 472

Answers (2)

matt
matt

Reputation: 79723

One thing you could do is to leave the whitespace in the string, and then use String#split to convert it to an array:

block.css('div.times span').text.gsub("&nbsp","").split(' ')

In this case you might need to strip out the left-to-right markers as well, and I don’t think you need to replace the non-breaking spaces, so you could try this:

block.css('div.times span').text.gsub("\u200e", '').split(' ')

(\u200e is the left-to-right marker).

An alternative with Nokogiri is to use xpath instead of CSS, which will enable you to select just the text nodes you want directly, then use map to convert to an array of strings:

block.xpath('//div[@class="times"]/span/text()').map(&:text)

Upvotes: 1

pguardiario
pguardiario

Reputation: 54984

Easiest is probably:

block.at('div.times').text.scan /\d{2}:\d{2}/

Upvotes: 2

Related Questions