JSN
JSN

Reputation: 451

Individually shorten URLs in Hive

I have URLs of the following structure:

https://pinball.globalzone.com/en_US/home?tic=1-dj33jl-dj33jl&goToRegisterNow=true

What I want to do now is to shorten the URLs to be able to group and count similar URL patterns. For instance, I want to cut out https://, the locale en_US/ and the token ?tic=1-dj33jl-dj33jl while keeping the rest. The result should look as follows:

pinball.globalzone.com/home&goToRegisterNow=true

I tried to achieve that by using regexp_extract but this method only lets me extract specific pieces that are always at the same position.

The bigger problem is that the parts I want to cut out are either individual/rule-based (i.e. the locale always contains of two lower case and two upper case letters separated by a underscore) or unique with no guaranteed length (i.e. the token).

Moreover, my resultset will also contain URLs with a different pattern in which I only want to cut the existing parts (e.g. https://pinball.globalzone.com/en_US/forgottenPassword, in which only en_US/ has to be cut out).

If I would have to solve the problem quickly I would just get URLs and write some piece of Java or R code to split the get URLs into pieces and iterate through the array while cutting out all parts I don't need. However, I was wondering if there is a more elegant way to get this result straight out of Hive.

Upvotes: 0

Views: 109

Answers (1)

SamWhan
SamWhan

Reputation: 8332

What about

(?:https?:\/\/|\/[a-z]{2}_[A-Z]{2}|[?&]tic=[^&?]*)

It matches the parts you've described as unwanted. Replace that with an empty string should leave you with what you want.

See it here at regex101.

Edit

Updated to check for tic=. Should make it more stable.

And I don't know if it's what you want, but this one allows tic= to be any parameter, not only the first:

(?:https?:\/\/|\/[a-z]{2}_[A-Z]{2}|[?&]tic=[^&?\n]*)

Here at regex101

Upvotes: 1

Related Questions