Reputation: 3080
Since it's a regex question. This is a potential duplicated question.
Considering those given strings
test_str = [
"bla bla google.com bla bla", #0
"bla bla www.google.com bla bla", #1
"bla bla api.google.com bla bla", #2
"google.com", #3
"www.google.com", #4
"api.google.com", #5
"http://google.com", #6
"http://www.google.com", #7
"http://api.google.com", #8
"bla bla http://www.google.com bla bla", #9
"bla bla https://www.api.google.com bla bla" #10
]
My desired return is google.*
or www.google.*
but not api.google.*
. Which means, in above case, 2, 5, 8, 10 should not return any match.
I have tried several regex, but I can not find a one line regex string for doing this tasks. Here are what I tried.
re.compile("((http[s]?://)?www\.google[a-z.]*)") # match 1,4,7,9
re.compile("((http[s]?://)?google[a-z.]*)") # match all
re.compile("((http[s]?://)?.+\.google[a-z.]*)") # match except 0,3,6
re.compile("((http[s]?://)?!.+\.google[a-z.]*)") # match nothing
Here, I am seeking a way to ignore *.google.*
except www.google.*
and google.*
. But I got stuck while finding a way to get *.google.*
.
PS: I have found a O(n**2) way with split()
to solve this.
r = re.compile("^((http[s]?://)?www.google[a-z.]*)|^((http[s]?://)?google[a-z.]*)")
for s in test_str:
for seg in s.split():
r.findall(seg)
Upvotes: 3
Views: 85
Reputation: 1061
Had my keyboard been working properly I would have answered a half hour before.
Anyway, I would recommend to not exaggerate the complexity of regexes. You can use the host language to manage black- (and even white-) lists and use the re
module auxiliary. Below is what I did all packed inside a script. Obviously you may need some restructuring if you have to integrate this code into a class or function:
import re
def main():
input_urls = [
"bla bla google.com bla bla",
"bla bla www.google.com bla bla",
# ...
]
filtered_urls = set()
google_re = re.compile("(\w+\.)?google.com")
blacklist = set(["api."]) # I didn't research enough to remove the dot
for url in input_urls:
# Beware of the difference between match() and search()
# See https://docs.python.org/3/library/re.html#search-vs-match
match = google_re.search(url)
# The second condition will not be evaluated if the first fails
if match is not None and match.group(1) not in blacklist:
filtered_urls.add(url)
print("Accepted URLs:", *filtered_urls, sep="\n\t", end="\n\n")
print("Blacklisted URLs:", *(set(input_urls).difference(filtered_urls)), sep="\n\t")
if __name__ == "__main__":
main()
Unfortunately, with my a
and h
keyboard keys not working, I wasn't able to quickly find a way to remove the dot in the URL location (like in api.google
, www.google
, calendar.google
and so on). I highly recommend to do that.
The output displayed on my console was:
None@vacuum:~$ python3.6 ./filter.py
Accepted URLs:
http://google.com
bla bla google.com bla bla
bla bla www.google.com bla bla
http://www.google.com
google.com
www.google.com
bla bla http://www.google.com bla bla
Blacklisted URLs:
api.google.com
bla bla api.google.com bla bla
http://api.google.com
bla bla https://www.api.google.com bla bla
Upvotes: 1
Reputation: 626689
You may use
(?<!\S)(?:https?://)?(?:www\.)?google\.\S*
See the regex demo.
Details
(?<!\S)
- a location preceded with a whitespace or start of a string (note that you may also use (?:^|\s)
here, to be more explicit)(?:https?://)?
- an optional non-capturing group matching an optional sequence of https://
or http://
(?:www\.)?
an optional non-capturing group matching an optional sequence of www.
google\.
- a google.
substring\S*
- 0+ non-whitespace chars.import re
test_str = [
"bla bla google.com bla bla", #0
"bla bla www.google.com bla bla", #1
"bla bla api.google.com bla bla", #2
"google.com", #3
"www.google.com", #4
"api.google.com", #5
"http://google.com", #6
"http://www.google.com", #7
"http://api.google.com", #8
"bla bla http://www.google.com bla bla", #9
"bla bla https://www.api.google.com bla bla", #10
"bla bla https://www.map.google.com bla bla" #11
]
r = re.compile(r"(?<!\S)(?:https?://)?(?:www\.)?google\.\S*")
for i,s in enumerate(test_str):
m = r.search(s)
if m:
print("{}\t#{}".format(m.group(0), i))
Output:
google.com #0
www.google.com #1
google.com #3
www.google.com #4
http://google.com #6
http://www.google.com #7
http://www.google.com #9
Upvotes: 1