Egor Maksimov
Egor Maksimov

Reputation: 62

Hive: using RegexSerDe with multiple tabs

I need to load txt table to hive. Txt table contains tab separated columns of a table. The thing is that there are columns separated by more than one tab:

135.124.143.193   20150601013300 http://newsru.com/4712386 253 486 ok

There are three tabs between first and second column and exactly one tab between rest of the columns.

I have figured out I need to use RegexSerDe in order to process that correctly. I am trying

row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
    with serdeproperties ("input.regex" = "(\\S+)\\t+")

but as I attempt to select from result table I get:

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns

Upvotes: 0

Views: 273

Answers (1)

The fourth bird
The fourth bird

Reputation: 163207

If there is not tab at the end of the string, you will not match the last value and this will only get the first capturing group.

What you might do is match all number of groups separated by 1+ tabs.

As per the comments this pattern works for you.

(\\S+)\\t+(\\S+)\\t(\\S+)\\t(\\S+)\\t(\\S+)\\t(\\S+).*

Regex demo

If you want to match exactly 3 tabs after the first group, you could use a quantifier \\t{3} instead of the \\t+

If you want to match exacty all the columns, you might also use anchors to assert the start ^ and the end $ of the string.

^(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)$

Upvotes: 1

Related Questions