Reputation: 23
I would like to match instances of a word in string, as long as the word is not in a URL.
An example would be find the instances of 'hello' in the following:
hello this is a regex problem http://geocities.com/hello/index.html?hello! Hello how are you!
The simplest regex for this problem is:
/\bhello\b/i
However this returns all four instances of 'hello', including the two contained within the URL string.
I have experimented with negative look-behinds for 'http' but so far nothing has worked. Any ideas?
Upvotes: 2
Views: 201
Reputation: 627607
Here are several solutions based on The Best Regex Trick Ever for 1) counting matches outside of a URL, 2) removing matches not in a URL, and 3) wrapping the matches with a tag outside of a URL:
s = "hello this is a regex problem http:"+"//geocities.com/hello/index.html?hello! Hello how are you!"
# Counting
p s.scan(/https?:\/\/\S*|(hello)/i).flatten.compact.count
## => 2
# Removing
p s.gsub(/(https?:\/\/\S*)|hello/i, '\1')
## => " this is a regex problem http://geocities.com/hello/index.html?hello! how are you!"
# Wrapping with a tag
p s.gsub(/(https?:\/\/\S*)|(hello)/i) { $1 || "<span>#{$2}</span>" }
## => "<span>hello</span> this is a regex problem http://geocities.com/hello/index.html?hello! <span>Hello</span> how are you!"
You may wrap hello
pattern with word boundaries if you need to match a whole word, \bhello\b
.
See the online Ruby demo
Notes
.scan(/https?:\/\/\S*|(hello)/i).flatten.compact.count
- matches a URL starting with http
or https
, or matches and captures hello
in Group 1, .scan
only returns captured substrings, but it also returns nil
once the URL is matched, so .compact
is required to remove nil
items from the flatten
ed array and .count
returns the number of items in the array..gsub(/(https?:\/\/\S*)|hello/i, '\1')
matches and captures URLs into Group 1 and hello
just matches all hello
s outside of URLs, and the matches are replaced with \1
, backreference to Group 1 that is an empty string when just hello
is found.s.gsub(/(https?:\/\/\S*)|(hello)/i) { $1 || "<span>#{$2}</span>" }
matches and captures URLs into Group 1 and hello
s into Group 2. If Group 1 was matched, $1
puts this value back into the string, else, the Group 2 is wrapped with tags and inserted back into the string.Upvotes: 1
Reputation: 27763
Here, we can first collect our URLs, altered by our desired words in a capturing group, with an expression similar to:
http[^\s]+|(hello|you)
jex.im visualizes regular expressions:
The fourth bird advises that:
I would go for the word boundaries and only
hello
in the group:\bhttp\S+|\b(hello)\b
Upvotes: 0
Reputation: 76
If I'm correct you need to get words after url. You can just use space(\s) as delimiter of your string
"http://geocities.com/hello/index.html?hello! Hello how are you!".scan(/\s(\w+)/i)
=> [["Hello"], ["how"], ["are"], ["you"]]
Or
"http://geocities.com/hello/index.html?hello! Hello how are you!".scan(/\s(hello)/i)
=> [["Hello"]]
Upvotes: 0