Parse out part of URL using regex in Python

Question

I want to parse out a part of URL using regex operation. This might be old question. But I am new to regex and searched so much for my requirement and not able to find it. I know ParseURL can be used here. But my URLs are not properly structured to use that. Suppose my URL is as follows,

url = https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed

Here I want to find out when &q= occurs and parse out until & occurs next. I want to remove + or any special characters in the middle. The output should be,

To Be Parsed out

Also if there is no match, the original URL should be returned.

I have tried the following,

re.search('q=?([^&]+)&',url).group(0)

this returns,

&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed

Can anybody help me in parsing this out. Thanks

heemayl · Accepted Answer

You can use re.search() to get the desired substring and then replace all + with spaces with str.replace():

re.search(r'/&q=([^&]*)', url).group(1).replace('+', ' ')

re.search(r'/&q=([^&]*)', url).group(1) gets the desired portion and replace('+', ' ') does the replaements

Example:

In [56]: url
Out[56]: 'https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed'

In [57]: re.search(r'/&q=([^&]*)', url).group(1).replace('+', ' ')
Out[57]: 'To Be Parsed out'

In case when there is no match, catch the AttributeError exception raised by re.search.group() e.g.:

try:
    out = re.search(r'/&q=([^&]*)', url).group(1).replace('+', ' ')
except AttributeError:
    ## No match, do what you want

Parse out part of URL using regex in Python

Answers (1)

Related Questions