Reputation: 451
I have URLs of the following structure:
https://pinball.globalzone.com/en_US/home?tic=1-dj33jl-dj33jl&goToRegisterNow=true
What I want to do now is to shorten the URLs to be able to group and count similar URL patterns. For instance, I want to cut out https://
, the locale en_US/
and the token ?tic=1-dj33jl-dj33jl
while keeping the rest. The result should look as follows:
pinball.globalzone.com/home&goToRegisterNow=true
I tried to achieve that by using regexp_extract
but this method only lets me extract specific pieces that are always at the same position.
The bigger problem is that the parts I want to cut out are either individual/rule-based (i.e. the locale always contains of two lower case and two upper case letters separated by a underscore) or unique with no guaranteed length (i.e. the token).
Moreover, my resultset will also contain URLs with a different pattern in which I only want to cut the existing parts (e.g. https://pinball.globalzone.com/en_US/forgottenPassword
, in which only en_US/
has to be cut out).
If I would have to solve the problem quickly I would just get URLs and write some piece of Java or R code to split the get URLs into pieces and iterate through the array while cutting out all parts I don't need. However, I was wondering if there is a more elegant way to get this result straight out of Hive.
Upvotes: 0
Views: 109
Reputation: 8332
What about
(?:https?:\/\/|\/[a-z]{2}_[A-Z]{2}|[?&]tic=[^&?]*)
It matches the parts you've described as unwanted. Replace that with an empty string should leave you with what you want.
Edit
Updated to check for tic=
. Should make it more stable.
And I don't know if it's what you want, but this one allows tic=
to be any parameter, not only the first:
(?:https?:\/\/|\/[a-z]{2}_[A-Z]{2}|[?&]tic=[^&?\n]*)
Upvotes: 1