Parse out a URL with regex operation in python

Question

I have data as follows,

data

url
http://hostname.com/part1/part2/part3/a+b+c+d
http://m.hostname.com/part3.html?nk!e+f+g+h&_junk
http://hostname.com/as/ck$st=f+g+h+k+i/
http://www.hostname.com/p-l-k?wod=q+w+e+r+t africa

I want to check for first + symbol in the url and move backward until we find a special character such as / or ? or = or any other special character and start from that and go on until we find a space or end of line or & or /.My output should be,

parsed
abcd
efgh
fghki
qwert

My aim is to find first + in the URL and go back until we find a special character and go front until we find a end of line or space or & symbol.

I am new to regex and still learning it and since it is bit complex, I am finding it difficult to write. Can anybody help me in writing a regex in python, to parse out these?

Thanks

alecxe · Accepted Answer

Here is the expression that works for your sample use cases:

>>> import re
>>>
>>> l = [
...     "http://hostname.com/part1/part2/part3/a+b+c+d",
...     "http://m.hostname.com/part3.html?nk!e+f+g+h&_junk",
...     "http://hostname.com/as/ck$st=f+g+h+k+i/",
...     "http://www.hostname.com/p-l-k?wod=q+w+e+r+t africa"
... ]
>>>
>>> pattern = re.compile(r"[^\w\+]([\w\+]+\+[\w\+]+)(?:[^\w\+]|$)")
>>> for item in l:
...     print("".join(pattern.search(item).group(1).split("+")))
... 
abcd
efgh
fghki
qwert

The idea is basically to capture alphanumerics and a plus character that is between the non-alphanumerics and non-plus character or the end of the string. Then, split by plus and join.

Regex101 link.

I have a feeling that it can be further simplified/improved.

Parse out a URL with regex operation in python

Answers (2)

Related Questions