Ronan Lopes
Ronan Lopes

Reputation: 3398

How to split by HTML tags using a regex

I have a string like this:

"Energia Elétrica kWh<span class=\"_ _3\"> </span>  10.942 <span class=\"_ _4\"> </span> 0,74999294 <span class=\"_ _5\"> </span>     8.206,39"

and I want to split it by its HTML tags, which are always <span>. I tried something like:

my_string.split(/<span(.*)span>/) 

but it didn't work, it only matched the first element correctly.

Does anyone know what is wrong with my regex? In this example, I expected the returned value to be:

["Energia Elétrica kWh", "10.942", "0,74999294" ,"8.206,39"]

I would like something like strip_tags, but instead of returning the string sanitized, get the array split by the tags removed.

Upvotes: 1

Views: 1859

Answers (2)

the Tin Man
the Tin Man

Reputation: 160551

Don't use a pattern to manipulate HTML. It's a path destined to make you insane.

Instead use a HTML parser. The standard for Ruby is Nokogiri:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse("Energia Elétrica kWh<span class=\"_ _3\"> </span>  10.942 <span class=\"_ _4\"> </span> 0,74999294 <span class=\"_ _5\"> </span>     8.206,39")

You could use text to extract all the text, but, if it's structured data you're after, that often makes it difficult to extract the fields because the text nodes can be concatenated resulting in run-on words, so be careful there:

doc.text # => "Energia Elétrica kWh   10.942   0,74999294       8.206,39"

Instead we typically extract the data from individual nodes:

doc.search('span')[1].next_sibling.text # => " 0,74999294 "
doc.search('span').last.next_sibling.text # => "     8.206,39"

Or, we iterate over the nodes, then use map to grab the node's text:

doc.search('span').map{ |span| span.next_sibling.text.strip }
# => ["10.942", "0,74999294", "8.206,39"]

I'd go about the problem like this:

data = [doc.at('span').previous_sibling.text.strip] # => ["Energia Elétrica kWh"]
data += doc.search('span').map{ |span| span.next_sibling.text.strip } 
# => ["Energia Elétrica kWh", "10.942", "0,74999294", "8.206,39"]

Or:

spans = doc.search('span')
data = [
  spans.first.previous_sibling.text,
  *spans.map{ |span| span.next_sibling.text }
].map(&:strip)
# => ["Energia Elétrica kWh", "10.942", "0,74999294", "8.206,39"]

While a regular expression can often work on an initial attempt, a change in the format of the HTML can break the pattern, forcing an additional change, then another change, and then another, until the pattern is too convoluted, whereas a properly written parser approach will typically be very resilient and immune to the problem.

Upvotes: 4

Steven B
Steven B

Reputation: 61

If you really need to use regex to do this, you pretty much had it already.

irb(main):010:0> string.split(/<span.+?span>/)
=> ["Energia Eltrica kWh", "  10.942 ", " 0,74999294 ", "     8.206,39"]

You just needed the ? to tell it to match as little as possible.

Upvotes: 1

Related Questions