Reputation: 103
I have a string containing URLs:
string = https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F
I want to extract all of them to have a result like this:
['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=','https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D','http%253A%252F%252Fwww.link-three.mu%252F']
I am trying:
urls = [x for x in re.split('(http[s]?)', string) if x]
print urls
And the result is:
['https', '://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-
F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https', '://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http', '%253A%252F%252Fwww.link-three.mu%252F']
How can I get the the complete URL together given that it can start with 'http' or 'https'?
Any ideas please?
Upvotes: 0
Views: 319
Reputation: 54
Without using re
, you can handle this problem as follows:
['http' + x for x in filter(lambda x: x, string.split('http'))]
The result will be:
['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-
F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https://aax-us.link-
two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http%253A%252F%252Fwww.link-
three.mu%252F']
Upvotes: 2
Reputation: 140168
You could use your result and join 2 consecutive matches, that would work.
urls = [urls[i]+urls[i+1] for i in range(0,len(urls),2)]
But better use findall
with a lookahead on https?
or end of string:
import re
string = "https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F"
print(re.findall("https?.*?(?=https?|$)",string))
result:
['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=',
'https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D',
'http%253A%252F%252Fwww.link-three.mu%252F']
as noted in comments, since you cannot add :
to the delimiter, you have no way of being sure of the URL delimitation (if an URL contains http
inside the address you're toast)
Upvotes: 1