Reputation: 20856
I have multiple links...
linkslist =
[https://test.com
,https://test1.example.com/exm/1/2/3/4
,https://test2.example.test.com/exm/1/2/3/4
,http://test3.com]
From this, I just need to extract the following,
https://test.com
https://test1.com
https://test2.com
http://test3.com
I have tried the following,
if re.search("http*.com",string1):
... print "found"
Upvotes: 2
Views: 77
Reputation: 20163
UPDATE: Fixed thanks to @Robin. It worked, but it was a little bit off from what I intended.
Assuming only http or https (and no ports), this works:
(https?://(?:\w+\.)+com)(?:/.*)?
The url is in capture group one.
Explanation of (?:\w+\.)+
:
For example, this portion captures usatoday.
and entertainment.usatoday.
. All the pre-domain (.com
) portions of the url.
To be safe you could also add start- and end-of-line anchors:
^(https?://(?:\w+\.)+com)(?:/.*)?$
To add the possibility of different domains, add them like this:
^(https?://(?:\w+\.)+(?:com|net|org|gov))(?:/.*)?$
Note that this question, and its duplicate, will also be of help: regular expression for url
Upvotes: 3
Reputation: 809
If you don't want to be specific about the .com part, you could use this. It will match URLs starting with http or https and it will only match up until til first forward slash or the end of the string/line, including any port numbers that might be present.
/https?:\/\/[^\/$\s]+/i
These are the results:
https://test.com -> https://test.com
https://test1.example.com/exm/1/2/3/4 -> https://test1.example.com
https://test2.example.test.com/exm/1/2/3/4 -> https://test2.example.test.com
http://test3.com -> http://test3.com
https://test.com:8080 -> https://test.com:8080
https://test1.example.com:3000/exm/1/2/3/4 -> https://test1.example.com:3000
https://test2.example.test.com:80/exm/1/2/3/4 -> https://test2.example.test.com:80
http://test3.com:8000 -> http://test3.com:8000
If you want to exclude port numbers, just add a colon to the non-matching group:
/https?:\/\/[^\/$\s:]+/i
If you do want to be specific about the .com-part, just add that last:
https?:\/\/[^\/\s]+\.com
If you want only .com-domains, but would like to include port numbers, this is the way to go:
https?:\/\/[^\/\s]+\.com(:\d+)?
Upvotes: 1