Reputation: 153
Below is my regex to extract URL
url_extractor re.compile(r'((?:www\.|http:|https:)[^\s]+)', re.IGNORECASE)
mystring = """https://myname.abc.comsomename: """
The regex above extracts URL and any characters present after .com
, in this case somename
: https://myname.abc.comsomename
.
I want to extract only until .com
or .org
(inclusive) if present. If URL does not end with .com
or .org
I would like to extract until whitespace.
So in the above example, the expected result https://myname.abc.com
.
if the URL is https://myname.abc.xyz somename
, expected result is
https://myname.abc.xyx
.
How do I modify my regex above?
Upvotes: 0
Views: 247
Reputation: 626932
You may use
re.compile(r'(?:www\.|https?:)\S*?(?:\.(?:com|org)|(?=\s)|$)', re.IGNORECASE)
See the regex demo
Details
(?:www\.|https?:)
- www.
or http:
or https:
\S*?
- 0 or more non-whitespace chars, as few as possible(?:\.(?:com|org)|(?=\s)|$)
- either .
and then either com
or org
, or a location immediately followed with a whitespace, or end of string.import re
text = r'somename https://myname.abc.comsomename: if the URL is https://myname.abc.xyz somename..'
rx = re.compile(r'(?:www\.|https?:)\S*?(?:\.(?:com|org)|(?=\s)|$)', re.IGNORECASE)
print ( rx.findall(text) )
# => ['https://myname.abc.com', 'https://myname.abc.xyz']
Upvotes: 1