Reputation: 23

Regex to find instances of a word where it is not in the path of a URL

I would like to match instances of a word in string, as long as the word is not in a URL.

An example would be find the instances of 'hello' in the following:

hello this is a regex problem http://geocities.com/hello/index.html?hello! Hello how are you!

The simplest regex for this problem is:

/\bhello\b/i

However this returns all four instances of 'hello', including the two contained within the URL string.

I have experimented with negative look-behinds for 'http' but so far nothing has worked. Any ideas?

Upvotes: 2

Answers (3)

Wiktor Stribiżew

Reputation: 627607

Here are several solutions based on The Best Regex Trick Ever for 1) counting matches outside of a URL, 2) removing matches not in a URL, and 3) wrapping the matches with a tag outside of a URL:

s = "hello this is a regex problem http:"+"//geocities.com/hello/index.html?hello! Hello how are you!"
# Counting
p s.scan(/https?:\/\/\S*|(hello)/i).flatten.compact.count
## => 2
# Removing
p s.gsub(/(https?:\/\/\S*)|hello/i, '\1')
## => " this is a regex problem http://geocities.com/hello/index.html?hello!  how are you!"
# Wrapping with a tag
p s.gsub(/(https?:\/\/\S*)|(hello)/i) { $1 || "<span>#{$2}</span>" }
## => "<span>hello</span> this is a regex problem http://geocities.com/hello/index.html?hello! <span>Hello</span> how are you!"

You may wrap hello pattern with word boundaries if you need to match a whole word, \bhello\b.

See the online Ruby demo

Notes

.scan(/https?:\/\/\S*|(hello)/i).flatten.compact.count - matches a URL starting with http or https, or matches and captures hello in Group 1, .scan only returns captured substrings, but it also returns nil once the URL is matched, so .compact is required to remove nil items from the flattened array and .count returns the number of items in the array.
.gsub(/(https?:\/\/\S*)|hello/i, '\1') matches and captures URLs into Group 1 and hello just matches all hellos outside of URLs, and the matches are replaced with \1, backreference to Group 1 that is an empty string when just hello is found.
s.gsub(/(https?:\/\/\S*)|(hello)/i) { $1 || "<span>#{$2}</span>" } matches and captures URLs into Group 1 and hellos into Group 2. If Group 1 was matched, $1 puts this value back into the string, else, the Group 2 is wrapped with tags and inserted back into the string.

Upvotes: 1

Emma

Reputation: 27763

Here, we can first collect our URLs, altered by our desired words in a capturing group, with an expression similar to:

http[^\s]+|(hello|you)

Demo

RegEx Circuit

jex.im visualizes regular expressions:

Advice

The fourth bird advises that:

I would go for the word boundaries and only hello in the group: \bhttp\S+|\b(hello)\b

Upvotes: 0

Alex Dmitriev

Reputation: 76

If I'm correct you need to get words after url. You can just use space(\s) as delimiter of your string

"http://geocities.com/hello/index.html?hello! Hello how are you!".scan(/\s(\w+)/i)

=> [["Hello"], ["how"], ["are"], ["you"]]

 "http://geocities.com/hello/index.html?hello! Hello how are you!".scan(/\s(hello)/i)

=> [["Hello"]]