Reputation: 111
I've been trying to remove all utm_* parameters from a list of URLs. The closest thing I have found is this: https://gist.github.com/626834.
Any ideas?
Upvotes: 8
Views: 4334
Reputation: 1845
The popular url modules answer modifies and rearranges the parameters which can break poorly-designed sites (see the comment) so I settled on regex, but there are problems with those I tried, too. My final result is this:
import re
def removeURLTracking(url):
url = re.sub(r'(?<=[?&])utm_[^&#]+&?', '', url)
url = url.replace('&#', '#').rstrip('?&')
return url
tests = ["http://localhost/index.php?a=1&utm_source=1&b=2",
"http://localhost/index.php?a=1&utm_source=1&b=2#hash",
"http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
"http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
"http://localhost/index.php?utm_a=a",
"http://localhost/index.php?a=utm_a",
"http://localhost/index.php?a=1&b=2",
"http://localhost/index.php",
"http://localhost/index.php#hash2"
]
for t in tests:
print(removeURLTracking(t))
"""
http://localhost/index.php?a=1&b=2
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php
http://localhost/index.php?a=utm_a
http://localhost/index.php?a=1&b=2
http://localhost/index.php
http://localhost/index.php#hash2
"""
Upvotes: 0
Reputation: 143
Using regex
import re
def clean_url(url):
return re.sub(r'(?<=[?&])utm_[^&]+&?', '', url)
What's going on? We are using regular expressions to find all instances of strings that look like utm_somekey=somevalue which is preceded by either "?" or "&".
Testing it:
tests = [ "http://localhost/index.php?a=1&utm_source=1&b=2",
"http://localhost/index.php?a=1&utm_source=1&b=2#hash",
"http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
"http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
"http://localhost/index.php?utm_a=a",
"http://localhost/index.php?a=utm_a",
"http://localhost/index.php?a=1&b=2",
"http://localhost/index.php",
"http://localhost/index.php#hash2"
]
for t in tests:
print(clean_url(t))
http://localhost/index.php?a=1&b=2
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2&
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?
http://localhost/index.php?a=utm_a
http://localhost/index.php?a=1&b=2
http://localhost/index.php
http://localhost/index.php#hash2
Upvotes: 0
Reputation: 665
How about this. Nice and simple:
url = 'https://searchengineland.com/amazon-q3-ad-revenues-surpass-1-billion-roughly-2x-early-2016-285763?utm_source=feedburner&utm_medium=feed&utm_campaign=feed-main'
print url[:url.find('?utm')]
https://searchengineland.com/amazon-q3-ad-revenues-surpass-1-billion-roughly-2x-early-2016-285763
Upvotes: 1
Reputation: 7068
Simple, and works, and based on the link you posted, BUT it's re... so, not sure it won't break for some reason that I can't think of :)
import re
def trim_utm(url):
if "utm_" not in url:
return url
matches = re.findall('(.+\?)([^#]*)(.*)', url)
if len(matches) == 0:
return url
match = matches[0]
query = match[1]
sanitized_query = '&'.join([p for p in query.split('&') if not p.startswith('utm_')])
return match[0]+sanitized_query+match[2]
if __name__ == "__main__":
tests = [ "http://localhost/index.php?a=1&utm_source=1&b=2",
"http://localhost/index.php?a=1&utm_source=1&b=2#hash",
"http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
"http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
"http://localhost/index.php?utm_a=a",
"http://localhost/index.php?a=utm_a",
"http://localhost/index.php?a=1&b=2",
"http://localhost/index.php",
"http://localhost/index.php#hash2"
]
for t in tests:
trimmed = trim_utm(t)
print t
print trimmed
print
Upvotes: 1
Reputation: 142106
It's a bit long but uses the url* modules, and avoids re's.
from urllib import urlencode
from urlparse import urlparse, parse_qs, urlunparse
url = 'http://whatever.com/somepage?utm_one=3&something=4&utm_two=5&utm_blank&something_else'
parsed = urlparse(url)
qd = parse_qs(parsed.query, keep_blank_values=True)
filtered = dict( (k, v) for k, v in qd.iteritems() if not k.startswith('utm_'))
newurl = urlunparse([
parsed.scheme,
parsed.netloc,
parsed.path,
parsed.params,
urlencode(filtered, doseq=True), # query string
parsed.fragment
])
print newurl
# 'http://whatever.com/somepage?something=4&something_else'
Upvotes: 10
Reputation: 50177
import re
from urlparse import urlparse, urlunparse
url = 'http://www.someurl.com/page.html?foo=bar&utm_medium=qux&baz=qoo'
parsed_url = list(urlparse(url))
parsed_url[4] = '&'.join(
[x for x in parsed_url[4].split('&') if not re.match(r'utm_', x)])
utmless_url = urlunparse(parsed_url)
print utmless_url # 'http://www.someurl.com/page.html?foo=bar&baz=qoo'
Upvotes: 2