Reputation: 131
I have string like this:
LLC1R.8888.GR0054656*DR.798012...2..............GR0054656*EUR*
I already make extraction with this:
REGEXP_EXTRACT(b.COLUMN_YY,'^([^\.]+)\.?',2)
but it occur with error 2 : mapredtask
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@5875de6a of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments :java.lang.String, ^([^.]+).?:java.lang.String, 2:java.lang.Integer} of size 3
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1024)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:194)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.lang.IndexOutOfBoundsException: No group 2
what went wrong and how it should be fixed?
Any help/ideas will be appreciated. Thank you.
Upvotes: 1
Views: 737
Reputation: 626927
It is clear what went wrong from the error stack trace: "Caused by: java.lang.IndexOutOfBoundsException: No group 2"
Returns the string extracted using the pattern. For example,
regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
returns'bar.'
Note that some care is necessary in using predefined character classes: using'\s'
as the second argument will match the letters
;'\\s'
is necessary to match whitespace, etc. The'index'
parameter is the Java regex Matchergroup()
method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regexgroup()
method.
So, to get the first substring before .
, you should use
REGEXP_EXTRACT(b.COLUMN_YY, '^([^.]*)', 1)
To get the second string between .
, use
REGEXP_EXTRACT(b.COLUMN_YY, '^[^.]*\\.([^.]*)', 1)
To get the nth string between .
, use (?:[^.]*\.){n-1}
. So, to get the third one, use
REGEXP_EXTRACT(b.COLUMN_YY, '^(?:[^.]*\\.){2}([^.]*)', 1)
Note that 1
index argument makes REGEXP_EXTRACT
return the value captured in Group 1, see the pattern details below:
^
- start of string(?:[^.]*\.){2}
- here, two occurrences of
[^.]*
- 0 or more chars other than .
\.
- a .
char([^.]*)
- Group 1: any 0 or more chars other than .
.Upvotes: 2