Reputation: 20647
I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers
Upvotes: 8
Views: 8222
Reputation: 96081
Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx¶m3&CONTENT_ITEM_ID=1234¶m1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'
Upvotes: 14
Reputation: 1944
An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.
Upvotes: 0
Reputation: 597501
Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.
Upvotes: 1
Reputation: 27434
This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id
Upvotes: 0
Reputation: 392060
Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.
Upvotes: 0
Reputation: 6921
Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']
Upvotes: 3
Reputation: 45171
The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234¶m3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'
Upvotes: 4
Reputation: 60664
import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3'
m = re.search('(.*?)&', url)
print m.group(1)
Upvotes: 0
Reputation: 20647
I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234¶m2¶m3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
Upvotes: 1