Danny David Leybzon
Danny David Leybzon

Reputation: 680

Extract all characters before a period with HiveQL regex?

I have a table that looks like:

bl.ah
foo.bar
bar.fight

And I'd like to use HiveQL's regexp_extract to return

bl
foo
bar

Upvotes: 4

Views: 8604

Answers (1)

Cam
Cam

Reputation: 931

Given the docs data about regexp_extract:

regexp_extract(string subject, string pattern, int index)

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

So, if you have a table with a single column (let's call it description for our example) you should be able to use regexp_extract as follows to get the data before a period, if one exists, or the entire string in the absence of a period:

regexp_extract(description,'^([^\.]+)\.?',1)

The components of the regex are as follows:

  • ^ start of string
  • ([^\.]+) any non-period character one or more times, in a capture group
  • \.? a period either once or no times

Because the part of the string we're interested in will be in the first (and only) capture group, we refer to it by passing the index parameter a value of 1.

Upvotes: 5

Related Questions