Satchel
Satchel

Reputation: 16734

How do I remove a substring from a string in Ruby?

I have the following string, and I want to remove everything between the <EMAIL> tag including the tag itself:

"Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>" 

I use the following to remove it:

string =  string.gsub(/<EMAIL>(.*)<\/EMAIL>/, '').strip

It does not work.

When I remove the \n from the string (I'd prefer not to because it makes formatting and inputing more limiting), then I get the following:

=> "Great, I will send you something at [email protected]."

In other words, it works when I remove that.

How do I change my gsub statement to accommodate for \n and why does that cause the failure?

Upvotes: 0

Views: 262

Answers (2)

the Tin Man
the Tin Man

Reputation: 160621

What you're doing can work, but it's very fragile, and as a result is not recommended. Instead, use a parser like Nokogiri:

require 'nokogiri'

str = "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"

Here's how to parse the document:

doc = Nokogiri::XML::DocumentFragment.parse(str)

If the string was valid XML I could use a shorter method to parse:

doc = Nokogiri::XML(str)

Now find and remove the tag and its contents:

doc.at('EMAIL').remove
puts doc.to_xml
# >> Great, I will send you something at [email protected].

at finds the first tag named <EMAIL> using a CSS selector. There are other related methods to find all matching tags or specific to CSS or XPath selectors.

XML/HTML parsers break the text down into nodes, making it easy to find things and manipulate them. The text can change, and as long as it's valid HTML or XML, properly written code will continue to work.

See the obligatory "RegEx match open tags except XHTML self-contained tags".

Regular expressions break down badly if there are embedded duplicate tags, something like:

<b>bold <i>italic <b>another bold</b></i></b>

Trying to strip the <b> tags with patterns only would be painful. It's more easily done with a parser.

If I was absolutely bound-and-determined to do it without using a parser, this would work:

foo = "Great, I will send you something at [email protected].\n <EMAIL><ADDRESS>asdf</ADDRESS><SUBJECT>sdfg</SUBJECT>\n <BODY>dfgh</BODY></EMAIL>" 
foo.gsub(%r#<EMAIL>.*?</EMAIL>#im, '').strip
# => "Great, I will send you something at [email protected]."

Or:

foo.gsub(%r#\s*<EMAIL>.*?</EMAIL>\s*#im, '')
# => "Great, I will send you something at [email protected]."

I prefer the first of these two because it's visually clearer.

Use the i flag to make the pattern case-insensitive: It'll match both <email> and <EMAIL>. Use the m flag to allow . to treat line-ends as if they were normal characters. The default is to treat them like they're special which makes a string with embedded line-ends be treated as multiple lines.

I'd prefer not to because it makes formatting and inputing more limiting

Sometimes it's easier to strip something like a trailing newline in the pattern, then re-add it later. If the choice is between maintaining a little Ruby code or a complicated pattern, I'd take the Ruby code. Patterns are powerful and I use them, but they're not the answer to everything.

Upvotes: 2

SirDarius
SirDarius

Reputation: 42969

Your string is multiline, but by default, Ruby regexps work on a line-by-line basis, so <EMAIL> and </EMAIL> being on two different lines, the regexp will never match.

This because in default mode, the metacharacter . stands for Any character except a newline.

You need to use the m (multiline) flag:

s= "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"=> "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"
s.gsub(/<EMAIL>(.*)<\/EMAIL>/m, '').strip

This returns:

"Great, I will send you something at [email protected]."

Upvotes: 7

Related Questions