MLNLPEnhusiast
MLNLPEnhusiast

Reputation: 153

extract URL until .com , .org etc

Below is my regex to extract URL

url_extractor  re.compile(r'((?:www\.|http:|https:)[^\s]+)', re.IGNORECASE)
mystring = """https://myname.abc.comsomename: """

The regex above extracts URL and any characters present after .com, in this case somename: https://myname.abc.comsomename.

I want to extract only until .com or .org (inclusive) if present. If URL does not end with .com or .org I would like to extract until whitespace.

So in the above example, the expected result https://myname.abc.com.

if the URL is https://myname.abc.xyz somename, expected result is https://myname.abc.xyx.

How do I modify my regex above?

Upvotes: 0

Views: 247

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626932

You may use

re.compile(r'(?:www\.|https?:)\S*?(?:\.(?:com|org)|(?=\s)|$)', re.IGNORECASE)

See the regex demo

Details

  • (?:www\.|https?:) - www. or http: or https:
  • \S*? - 0 or more non-whitespace chars, as few as possible
  • (?:\.(?:com|org)|(?=\s)|$) - either . and then either com or org, or a location immediately followed with a whitespace, or end of string.

Python demo:

import re
text = r'somename https://myname.abc.comsomename: if the URL is https://myname.abc.xyz somename..'
rx = re.compile(r'(?:www\.|https?:)\S*?(?:\.(?:com|org)|(?=\s)|$)', re.IGNORECASE)
print ( rx.findall(text) )
# => ['https://myname.abc.com', 'https://myname.abc.xyz']

Upvotes: 1

Related Questions