Adrian
Adrian

Reputation: 103

Split a string but keep the delimiter in the same resulting substring in Python

I have a string containing URLs:

string = https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F

I want to extract all of them to have a result like this:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=','https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D','http%253A%252F%252Fwww.link-three.mu%252F']

I am trying:

urls = [x for x in re.split('(http[s]?)', string) if x]
print urls 

And the result is:

['https', '://www.link1.net/abc/cik?xai=En8MmT__aF_nQm- F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https', '://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http', '%253A%252F%252Fwww.link-three.mu%252F']

How can I get the the complete URL together given that it can start with 'http' or 'https'?

Any ideas please?

Upvotes: 0

Views: 319

Answers (2)

Emre Külah
Emre Külah

Reputation: 54

Without using re, you can handle this problem as follows:

['http' + x for x in filter(lambda x: x, string.split('http'))]

The result will be:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-
F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https://aax-us.link-
two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http%253A%252F%252Fwww.link-
three.mu%252F']

Upvotes: 2

Jean-François Fabre
Jean-François Fabre

Reputation: 140168

You could use your result and join 2 consecutive matches, that would work.

urls = [urls[i]+urls[i+1] for i in range(0,len(urls),2)]

But better use findall with a lookahead on https? or end of string:

import re

string = "https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F"

print(re.findall("https?.*?(?=https?|$)",string))

result:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=',
 'https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 
 'http%253A%252F%252Fwww.link-three.mu%252F']

as noted in comments, since you cannot add : to the delimiter, you have no way of being sure of the URL delimitation (if an URL contains http inside the address you're toast)

Upvotes: 1

Related Questions