Reputation: 67
I have the following regex:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
When I apply this to a text string with, let's say, "this is www.website1.com and this is website2.com", I get:
['www.website1.com']
['website.com']
How can i modify the regex to exclude the 'www
', so that I get 'website1.com'
and 'website2.com
? I'm missing something pretty basic ...
Upvotes: 3
Views: 7137
Reputation: 3186
Here a try :
import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)
O/P like :
'website1.com'
if it is s = "website1.com"
also it will o/p like :
'website1.com'
Upvotes: 0
Reputation: 51185
Try this one (thanks @SunDeep for the update):
\s(?:www.)?(\w+.com)
Explanation
\s
matches any whitespace character
(?:www.)?
non-capturing group, matches www.
0 or more times
(\w+.com)
matches any word character one or more times, followed by .com
And in action:
import re
s = 'this is www.website1.com and this is website2.com'
matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)
Output:
['website1.com', 'website2.com']
A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+
to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}
.
This answer has a lot of helpful info about matching domains: What is a regular expression which will match a valid domain name without a subdomain?
Next, I only look for .com
domains, you could adjust my regular expression to something like:
\s(?:www.)?(\w+.(com|org|net))
To match whichever types of domains you were looking for.
Upvotes: 4