Reputation: 295

Remove everything from a string between two sequences

I have a string in Rails that contains HTML. For example,

<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png" 
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>

How would I go about removing the link tag and everything between its beginning and end from the string?

The end result should look like this.

<p>01/28/2016 Green RED Horse!!123 456</p>
<p>01/28/2017 RED Horse!!123 456</p>

In short: How can I delete everything between <a and </a> inclusively. Without changing the rest of the string.

Upvotes: 1

Answers (3)

mtkcs

Reputation: 1716

Update: Better regex than the older one below.

string = <<HTML
<a-tag atr="attr">hi<a>atag</a></a-tag>
<a sdf="</a>"> hola</ a>
HTML
pattern = /<a(?:\s*>|\s+(?:(?:[^=\s]*?(?:=(?:(?:"[^"]*?")|(?:'[^']*?')))?)\s*)*>).*?<\/\s*a>/mi

string.gsub!(pattern, '')
puts string #=> <a-tag atr="attr">hi</a-tag>

Older answer

Something like this assuming that html is the string you want to parse

html.gsub! /<a\s?.+?a>/m, ''

You can use this if you have small sets of data similar to the one you posted. If you want a more robust and bug free solution you can use nokogiri, take a look at the answer of the Tin Man.

Upvotes: 4

the Tin Man

Reputation: 160551

I wouldn't use regex. Regular expressions might work, but the odds of them breaking when the HTML layout changes are very high.

Instead I'd use:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png" 
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>
EOT

doc.at('a').remove

puts doc.to_html
# >> <p>01/28/2016 Green RED Horse!!123 456</p>
# >> 
# >> <p>01/28/2017 RED Horse!!123 456</p>

This is using at which means "find the first occurrence of the desired selector." 'a' is a CSS selector.

Nokogiri is the defacto standard for HTML/XML parsing in Ruby. If you're going to be doing regular work with XML/HTML it is well worth learning to use it.

Upvotes: 3

Wand Maker

Reputation: 18762

You could use XPath to look up elements of interest.

require 'rexml/document'
include REXML

snippet = <<-eos
<p>01/28/2016 Green RED Horse!!123 456</p>
<a href="http://greenredhorse.com" style="margin-left:283px;margin-
top:50px;margin-bottom:150px;overflow:auto;position:absolute;">
<img alt="Logo" src="http://greenredhorse.com/images/icons/logo.png" 
style="width:266px" /> </a>
<p>01/28/2017 RED Horse!!123 456</p>
eos

well_formed_snippet = "<html>#{snippet}</html>"

xmldoc = Document.new(well_formed_snippet)
p XPath.match(xmldoc, "//p").map(&:to_s)
#=> ["<p>01/28/2016 Green RED Horse!!123 456</p>", "<p>01/28/2017 RED Horse!!123 456</p>"]

Upvotes: 2

Remove everything from a string between two sequences

Answers (3)

Update: Better regex than the older one below.

Older answer

Related Questions