Reputation: 651
I have data as follows,
data
url
http://hostname.com/part1/part2/part3/a+b+c+d
http://m.hostname.com/part3.html?nk!e+f+g+h&_junk
http://hostname.com/as/ck$st=f+g+h+k+i/
http://www.hostname.com/p-l-k?wod=q+w+e+r+t africa
I want to check for first + symbol in the url and move backward until we find a special character such as / or ? or = or any other special character and start from that and go on until we find a space or end of line or & or /.My output should be,
parsed
abcd
efgh
fghki
qwert
My aim is to find first + in the URL and go back until we find a special character and go front until we find a end of line or space or & symbol.
I am new to regex and still learning it and since it is bit complex, I am finding it difficult to write. Can anybody help me in writing a regex in python, to parse out these?
Thanks
Upvotes: 3
Views: 51
Reputation: 474061
Here is the expression that works for your sample use cases:
>>> import re
>>>
>>> l = [
... "http://hostname.com/part1/part2/part3/a+b+c+d",
... "http://m.hostname.com/part3.html?nk!e+f+g+h&_junk",
... "http://hostname.com/as/ck$st=f+g+h+k+i/",
... "http://www.hostname.com/p-l-k?wod=q+w+e+r+t africa"
... ]
>>>
>>> pattern = re.compile(r"[^\w\+]([\w\+]+\+[\w\+]+)(?:[^\w\+]|$)")
>>> for item in l:
... print("".join(pattern.search(item).group(1).split("+")))
...
abcd
efgh
fghki
qwert
The idea is basically to capture alphanumerics and a plus character that is between the non-alphanumerics and non-plus character or the end of the string. Then, split by plus and join.
I have a feeling that it can be further simplified/improved.
Upvotes: 1
Reputation: 1413
So the appropriate regex that shall parse the required characters you wanted is ((.\+)+.)
I am using Javascript regex here. But, You should be able to implement in py as well.
This regex shall extract you a+b+c+d
from your first url.
It will need to be processed a little bit more to get abcd
from a+b+c+d
.
I will update this with py function in a bit.
Upvotes: 1