Regex :: How to create regex for hive regex serde

Question

I am using Serde2 (an Apache Hive module) which can use regex to split data.
I am try to write a regex to split the following data:

123~|`sample~|`text
12~|`ss|gs~|`max`s

The delimiter or field separator is ~|`.
So far I have come up with this:

[^(?!^\~\|`$)]*\~\|`[^(?!\~\|`)]**\~\|`[^(?!\~\|`)]*

but this is not working. The error message is:

java.io.IOException:
    org.apache.hadoop.hive.serde2.SerDeException:
        Number of matching groups doesn't match the number of columns

How can I fix my Regex?

rchang · Accepted Answer

I think this is the regex you are looking for:

(.*?)~\|`(.*?)~\|`(.*)

In case you are worried about screening out lines in your data which might have a number of fields other than 3, you can add ^ and $ to the beginning and end of the regular expression respectively. That shouldn't be needed if you are pretty confident about your data however.

Note that the escaping backslashes themselves have to be escaped since this is a Java string. So, testing with your data in a local file:

# cat test.data
123~|`sample~|`text
12~|`ss|gs~|`max`s

And this is how your data gets de-serialized/serialized:

hive> CREATE TABLE table_name (
    >   first STRING,
    >   second STRING,
    >   third STRING
    > ) 
    >   ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    >   WITH SERDEPROPERTIES (
    >     "input.regex" = "(.*?)~\|`(.*?)~\|`(.*)",
    >     "output.format.string" = "%1$s %2$s %3$s"
    > );
OK
Time taken: 0.4 seconds
hive> LOAD DATA LOCAL INPATH 'test.data' INTO TABLE table_name;
Copying data from file:test.data
Copying file: file:test.data
Loading data to table default.table_name
Table default.table_name stats: [numFiles=1, numRows=0, totalSize=39, rawDataSize=0]
OK
Time taken: 0.601 seconds
hive> SELECT * FROM table_name;
OK
123 sample  text
12  ss|gs   max`s 
Time taken: 0.382 seconds, Fetched: 2 row(s)

I hope this helps.

Regex :: How to create regex for hive regex serde

Answers (1)

Related Questions