thecardcaptor
thecardcaptor

Reputation: 131

how to regexp_extract before nth occurance hive

I have string like this:

LLC1R.8888.GR0054656*DR.798012...2..............GR0054656*EUR*

I already make extraction with this: REGEXP_EXTRACT(b.COLUMN_YY,'^([^\.]+)\.?',2)

but it occur with error 2 : mapredtask

    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer)  on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@5875de6a of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments :java.lang.String, ^([^.]+).?:java.lang.String, 2:java.lang.Integer} of size 3
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1024)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:194)
    at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
    at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.lang.IndexOutOfBoundsException: No group 2

what went wrong and how it should be fixed?

Any help/ideas will be appreciated. Thank you.

Upvotes: 1

Views: 737

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

It is clear what went wrong from the error stack trace: "Caused by: java.lang.IndexOutOfBoundsException: No group 2"

See REGEXP_EXTRACT reference:

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

So, to get the first substring before ., you should use

REGEXP_EXTRACT(b.COLUMN_YY, '^([^.]*)', 1)

To get the second string between ., use

REGEXP_EXTRACT(b.COLUMN_YY, '^[^.]*\\.([^.]*)', 1)

To get the nth string between ., use (?:[^.]*\.){n-1}. So, to get the third one, use

REGEXP_EXTRACT(b.COLUMN_YY, '^(?:[^.]*\\.){2}([^.]*)', 1)

Note that 1 index argument makes REGEXP_EXTRACT return the value captured in Group 1, see the pattern details below:

  • ^ - start of string
  • (?:[^.]*\.){2} - here, two occurrences of
    • [^.]* - 0 or more chars other than .
    • \. - a . char
  • ([^.]*) - Group 1: any 0 or more chars other than ..

Upvotes: 2

Related Questions