Josh
Josh

Reputation: 43

HIVE regexp_extract URL strings

Hi I'm trying to parse large URL's from a log using Hive.

There is a particular value that I want to extract from the URL (strategy=??) the values can be hyphenated, but not always.

I built this sample query, but it just returns nothing.

What am I doing wrong?

select regexp_extract('234=23234&werw=asdf&strategy=retargeting&asdf=fds23', '(strategy=)([-\w*]*)',2) from vt;

so the value I'm expecting to get is retargeting from this partial URL string. 234=23234&werw=asdf&strategy=retargeting&asdf=fds23

Any help is greatly appreciated!!!

Upvotes: 0

Views: 4326

Answers (1)

c.maclean
c.maclean

Reputation: 411

I believe this regex will work for you:

strategy=((\w-?)+)

Here's a RegExr link: http://regexr.com?35sbl. After matching, group 1 contains the value of strategy. Note that this regex will match any number of hyphens in the value. It also fails if the hyphen is the first character (though, in my opinion, a leading hyphen does not make a value 'hyphenated').

From what I can tell, your method didn't return anything because of the way group 2 is set up: you have [-\w*], which says "match a hyphen and then any number of alphanumeric characters (including 0)". You could also rewrite this as [-?\w*]*, which says "match or don't match a hyphen, and then any number of alphanumeric characters (including 0)". However, this would then match just a hyphen, as in this case

strategy=-

Which is not something you want, I would think. A slightly safer way might be setting group 2 to [-?\w+]+, which would require at least one \w character after the equals sign. Happy coding! :)

Upvotes: 0

Related Questions