user1050619
user1050619

Reputation: 20856

Regularexpression to get the .com

I have multiple links...

linkslist = 
[https://test.com
,https://test1.example.com/exm/1/2/3/4
,https://test2.example.test.com/exm/1/2/3/4
,http://test3.com]

From this, I just need to extract the following,

https://test.com
https://test1.com
https://test2.com
http://test3.com

I have tried the following,

 if re.search("http*.com",string1):
...     print "found"

Upvotes: 2

Views: 77

Answers (2)

aliteralmind
aliteralmind

Reputation: 20163

UPDATE: Fixed thanks to @Robin. It worked, but it was a little bit off from what I intended.

Assuming only http or https (and no ports), this works:

(https?://(?:\w+\.)+com)(?:/.*)?

Regular expression visualization

Debuggex Demo

The url is in capture group one.

Explanation of (?:\w+\.)+:

  • One-or-more of
    • one-or-more word-character: letter, digit, or underscore
    • followed by a literal dot.

For example, this portion captures usatoday. and entertainment.usatoday.. All the pre-domain (.com) portions of the url.

To be safe you could also add start- and end-of-line anchors:

^(https?://(?:\w+\.)+com)(?:/.*)?$

To add the possibility of different domains, add them like this:

^(https?://(?:\w+\.)+(?:com|net|org|gov))(?:/.*)?$

Note that this question, and its duplicate, will also be of help: regular expression for url

Upvotes: 3

nordhagen
nordhagen

Reputation: 809

If you don't want to be specific about the .com part, you could use this. It will match URLs starting with http or https and it will only match up until til first forward slash or the end of the string/line, including any port numbers that might be present.

/https?:\/\/[^\/$\s]+/i

These are the results:

https://test.com -> https://test.com
https://test1.example.com/exm/1/2/3/4 -> https://test1.example.com
https://test2.example.test.com/exm/1/2/3/4 -> https://test2.example.test.com
http://test3.com -> http://test3.com
https://test.com:8080 -> https://test.com:8080
https://test1.example.com:3000/exm/1/2/3/4 -> https://test1.example.com:3000
https://test2.example.test.com:80/exm/1/2/3/4 -> https://test2.example.test.com:80
http://test3.com:8000 -> http://test3.com:8000

If you want to exclude port numbers, just add a colon to the non-matching group:

/https?:\/\/[^\/$\s:]+/i

If you do want to be specific about the .com-part, just add that last:

https?:\/\/[^\/\s]+\.com

If you want only .com-domains, but would like to include port numbers, this is the way to go:

https?:\/\/[^\/\s]+\.com(:\d+)?

Upvotes: 1

Related Questions