Reputation: 16565
I need to put the URL into the database. I don't want to store the same page twice so I need to strip all fluff off the URL.
# if I have
url_1 = "http://scientificamerican.com/royal-baby/?utm_campaign=promo"
# and
url_2 = "http://scientificamerican.com/royal-baby/?utm_source=email"
# then they should map to:
url_canonical = "http://scientificamerican.com/royal-baby/"
In order to get a single canonical URL regardless of what was on it I tried stripping the query string. The problem is that there are still CMSs which use the query string.
e.g.
url_1 = "https://www.scientificamerican.com/article.cfm?id=obama-budget"
# strip the query string and it becomes
url_1 = "https://www.scientificamerican.com/article.cfm"
# which is obviously the same for all articles :(
This is obviously a problem that a number of people have had to solve, not least the search engines. How do you reduce the URL down such that all that remains is the data for the page?
Upvotes: 2
Views: 750
Reputation: 19879
You can't. There is no way to know what query parameters are necessary to distinguish the URL. There are obviously many parameters you can knowingly remove (ie. utm_campaign, etc.) but not all.
You're best bet would be to load the HTML for the page and look for the canonical link element . If that exists, then you've got your canonical URL.
http://en.wikipedia.org/wiki/Canonical_link_element
Upvotes: 1