Reputation: 295
I have a string in Rails that contains HTML. For example,
<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png"
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>
How would I go about removing the link tag and everything between its beginning and end from the string?
The end result should look like this.
<p>01/28/2016 Green RED Horse!!123 456</p>
<p>01/28/2017 RED Horse!!123 456</p>
In short: How can I delete everything between <a
and </a>
inclusively. Without changing the rest of the string.
Upvotes: 1
Views: 1346
Reputation: 1716
string = <<HTML
<a-tag atr="attr">hi<a>atag</a></a-tag>
<a sdf="</a>"> hola</ a>
HTML
pattern = /<a(?:\s*>|\s+(?:(?:[^=\s]*?(?:=(?:(?:"[^"]*?")|(?:'[^']*?')))?)\s*)*>).*?<\/\s*a>/mi
string.gsub!(pattern, '')
puts string #=> <a-tag atr="attr">hi</a-tag>
Something like this assuming that html
is the string you want to parse
html.gsub! /<a\s?.+?a>/m, ''
You can use this if you have small sets of data similar to the one you posted. If you want a more robust and bug free solution you can use nokogiri, take a look at the answer of the Tin Man.
Upvotes: 4
Reputation: 160551
I wouldn't use regex. Regular expressions might work, but the odds of them breaking when the HTML layout changes are very high.
Instead I'd use:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png"
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>
EOT
doc.at('a').remove
puts doc.to_html
# >> <p>01/28/2016 Green RED Horse!!123 456</p>
# >>
# >> <p>01/28/2017 RED Horse!!123 456</p>
This is using at
which means "find the first occurrence of the desired selector." 'a'
is a CSS selector.
Nokogiri is the defacto standard for HTML/XML parsing in Ruby. If you're going to be doing regular work with XML/HTML it is well worth learning to use it.
Upvotes: 3
Reputation: 18762
You could use XPath
to look up elements of interest.
require 'rexml/document'
include REXML
snippet = <<-eos
<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png"
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>
eos
well_formed_snippet = "<html>#{snippet}</html>"
xmldoc = Document.new(well_formed_snippet)
p XPath.match(xmldoc, "//p").map(&:to_s)
#=> ["<p>01/28/2016 Green RED Horse!!123 456</p>", "<p>01/28/2017 RED Horse!!123 456</p>"]
Upvotes: 2