How to correctly parse this bad html in Nokogiri?

Question

I'm trying to parse this HTML with Nokogiri:


‎16:45‎
  ‎19:30‎ 
  ‎22:10‎

I only want to get the times, inserted in an array.

I set up a gsub like this:

 block.css('div.times span').text.gsub(" ","").gsub(" ","")

But then I end up with a single string and I'm kind of stuck. Is there an efficient way to do this?

matt · Accepted Answer

One thing you could do is to leave the whitespace in the string, and then use String#split to convert it to an array:

block.css('div.times span').text.gsub(" ","").split(' ')

In this case you might need to strip out the left-to-right markers as well, and I don’t think you need to replace the non-breaking spaces, so you could try this:

block.css('div.times span').text.gsub("\u200e", '').split(' ')

(\u200e is the left-to-right marker).

An alternative with Nokogiri is to use xpath instead of CSS, which will enable you to select just the text nodes you want directly, then use map to convert to an array of strings:

block.xpath('//div[@class="times"]/span/text()').map(&:text)

How to correctly parse this bad html in Nokogiri?

Answers (2)

Related Questions