Reputation: 20647

Slicing URL with Python

I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:

http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3

How could I slice out:

http://www.domainname.com/page?CONTENT_ITEM_ID=1234

Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.

Cheers

Upvotes: 8

Answers (10)

tzot

Reputation: 96081

Use the urlparse module. Check this function:

import urlparse

def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
    parsed= urlparse.urlsplit(url)
    filtered_query= '&'.join(
        qry_item
        for qry_item in parsed.query.split('&')
        if qry_item.startswith(keep_params))
    return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])

In your example:

>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:

>>> url='http://www.domainname.com/page?other_value=xx&param3&CONTENT_ITEM_ID=1234&param1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'

Upvotes: 14

neutrinus

Reputation: 2009

beside urlparse there is also furl, which has IMHO better API.

Upvotes: 0

Alien Life Form

Reputation: 1944

An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.

Upvotes: 0

Bite code

Reputation: 597501

Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.

E.G :

import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.

Upvotes: 1

Jeremy Cantrell

Reputation: 27434

This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.

url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id

Upvotes: 0

S.Lott

Reputation: 392060

Look at the urllib2 file name question for some discussion of this topic.

Also see the "Python Find Question" question.

Upvotes: 0

Kena

Reputation: 6921

Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.

   url.split("&")

returns a list with

  ['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']

Upvotes: 3

Rafał Dowgird

Reputation: 45171

The quick and dirty solution is this:

>>> "http://something.com/page?CONTENT_ITEM_ID=1234&param3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'

Upvotes: 4

Corey Goldberg

Reputation: 60664

import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
m = re.search('(.*?)&', url)
print m.group(1)

Upvotes: 0

RailsSon

Reputation: 20647

I figured it out below is what I needed to do:

url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'

Upvotes: 1

Slicing URL with Python

Answers (10)

Related Questions