DDS
DDS

Reputation: 67

Python Regex to Extract Domain from Text

I have the following regex:

r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'

When I apply this to a text string with, let's say, "this is www.website1.com and this is website2.com", I get:

['www.website1.com']

['website.com']

How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...

Upvotes: 3

Views: 7137

Answers (2)

Vikas Periyadath
Vikas Periyadath

Reputation: 3186

Here a try :

import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)

O/P like :

'website1.com'

if it is s = "website1.com" also it will o/p like :

'website1.com'

Upvotes: 0

user3483203
user3483203

Reputation: 51185

Try this one (thanks @SunDeep for the update):

\s(?:www.)?(\w+.com)

Explanation

\s matches any whitespace character

(?:www.)? non-capturing group, matches www. 0 or more times

(\w+.com) matches any word character one or more times, followed by .com

And in action:

import re

s = 'this is www.website1.com and this is website2.com'

matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)

Output:

['website1.com', 'website2.com']

A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.

This answer has a lot of helpful info about matching domains: What is a regular expression which will match a valid domain name without a subdomain?

Next, I only look for .com domains, you could adjust my regular expression to something like:

\s(?:www.)?(\w+.(com|org|net))

To match whichever types of domains you were looking for.

Upvotes: 4

Related Questions