Michael Durrant
Michael Durrant

Reputation: 96484

How can I remove non word characters from a text?

I want 'This Is A 101 Test' to be 'This Is A Test', but I can't get the syntax right.

src = 'This Is A 101 Test'
puts "A) " + src                       # base => "This Is A 101 Test"
puts "B) " + src[/([a-z]+)/]           # only does first word => "his"
puts "C) " + src.gsub!(/\D/, "")       # Does digits, I want alphabetic => "101"
puts "D) " + src.gsub!(/\W///g)        # Nothing. => ""
puts "E) " + src.gsub(/(\W|\d)/, "")   # Nothing. => ""

Upvotes: 15

Views: 21658

Answers (5)

brymck
brymck

Reputation: 7663

First off, you need to be careful with gsub and gsub!. The latter is "dangerous!" and will modify the value of src. If you're executing these statements in order, be aware that a.gsub!(/a/, "b") and a = a.gsub(/a/, "b") will both do the same thing to a. Part of the issue with your code is that src is being modified.

The B method returns "his" but makes no changes to source

src[/([a-z]+)/]     # => "his"
src                 # => "This Is A 101 Test"

The C method removes all characters that aren't numbers:

src.gsub!(/\D/, "") # => "101"
src                 # => "101"

The D method doesn't work because the syntax is wrong. The gsub method accepts a regular expression/string to search and then a string to use for replacement. If you try it in IRB it will act as though you need another / somewhere.

The E method replaces all non-word characters and all numbers:

src.gsub(/(\W|\d)/, "") # => "This Is A  Test" (note the two spaces)
src                     # => "This Is A 101 Test"

You point out that it's returning "". Well, what's actually happening is that C and D as listed (with syntax issues fixed) are destructive changes. (Also, if run on "101", D will actually return nil as no substitutions were performed.) So E is just being run on "101", and since you're replacing all non-words and all numbers with "", it becomes "101".


The answer you're looking for would be something like:

src.gsub!(/\d\s?/, "") # => "This Is A Test"
src                    # => "This Is A Test"

And my favorite for dealing with all scenarios of double spaces (because squeeze is quite efficient at combining like characters, strip is quite efficient at stripping trailing whitespace, and those ! return nil if they make no replacements):

src = src.gsub(/\d+/, "").squeeze(" ").strip

Upvotes: 28

steenslag
steenslag

Reputation: 80065

No regexp:

src = 'This Is A 101 Test'
src.delete('^a-zA-Z ') #the ^ negates everything

Upvotes: 8

Jonas Elfström
Jonas Elfström

Reputation: 31428

To remove all "non word characters" you can instead keep only those.

src = 'This Is A 101 Test'
src.gsub(/[^a-zA-Z ]/,'').gsub(/ +/,' ')
=> "This Is A Test"

I recommend Rubular for trying out Ruby regular expressions.

Upvotes: 8

Retief
Retief

Reputation: 3217

Do you just want to delete numbers? If so, src.gsub(/\d/,"") should work. The reason it doesn't work above is that gsub! modifies the string it is called on, so after C, src = "101" and eliminating all digits leaves an empty string.

If you want to eliminate everything but alphabetic characters and spaces (ie digits and punctuation), src.gsub(/(?=\S)(\d|\W)/,"") should work.

If you want to eliminate everything but alphabetic characters (eliminating spaces as well as digits and punctuation), src.gsub(/\d|\W/,"") should work.

Upvotes: 2

Sergio Tulentsev
Sergio Tulentsev

Reputation: 230306

Do you want to cut ' 101' from the string? Here's your regex

src = 'This Is A 101 Test'

puts src.gsub /\ \d+/, ''
# => This Is A Test

Also I don't understand why you are using bang version of gsub. gsub! modifies the original string, gsub copies it and modifies the copy.

Upvotes: 4

Related Questions