Reputation: 62
I need to load txt table to hive. Txt table contains tab separated columns of a table. The thing is that there are columns separated by more than one tab:
135.124.143.193 20150601013300 http://newsru.com/4712386 253 486 ok
There are three tabs between first and second column and exactly one tab between rest of the columns.
I have figured out I need to use RegexSerDe in order to process that correctly. I am trying
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "(\\S+)\\t+")
but as I attempt to select from result table I get:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns
Upvotes: 0
Views: 273
Reputation: 163207
If there is not tab at the end of the string, you will not match the last value and this will only get the first capturing group.
What you might do is match all number of groups separated by 1+ tabs.
As per the comments this pattern works for you.
(\\S+)\\t+(\\S+)\\t(\\S+)\\t(\\S+)\\t(\\S+)\\t(\\S+).*
If you want to match exactly 3 tabs after the first group, you could use a quantifier \\t{3}
instead of the \\t+
If you want to match exacty all the columns, you might also use anchors to assert the start ^
and the end $
of the string.
^(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)$
Upvotes: 1