Reputation: 614
I am using Serde2 (an Apache Hive module) which can use regex to split data.
I am try to write a regex to split the following data:
123~|`sample~|`text
12~|`ss|gs~|`max`s
The delimiter or field separator is ~|`
.
So far I have come up with this:
[^(?!^\~\|`$)]*\~\|`[^(?!\~\|`)]**\~\|`[^(?!\~\|`)]*
but this is not working. The error message is:
java.io.IOException:
org.apache.hadoop.hive.serde2.SerDeException:
Number of matching groups doesn't match the number of columns
How can I fix my Regex?
Upvotes: 0
Views: 3058
Reputation: 5236
I think this is the regex you are looking for:
(.*?)~\\|`(.*?)~\\|`(.*)
In case you are worried about screening out lines in your data which might have a number of fields other than 3, you can add ^
and $
to the beginning and end of the regular expression respectively. That shouldn't be needed if you are pretty confident about your data however.
Note that the escaping backslashes themselves have to be escaped since this is a Java string. So, testing with your data in a local file:
# cat test.data
123~|`sample~|`text
12~|`ss|gs~|`max`s
And this is how your data gets de-serialized/serialized:
hive> CREATE TABLE table_name (
> first STRING,
> second STRING,
> third STRING
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
> "input.regex" = "(.*?)~\\|`(.*?)~\\|`(.*)",
> "output.format.string" = "%1$s %2$s %3$s"
> );
OK
Time taken: 0.4 seconds
hive> LOAD DATA LOCAL INPATH 'test.data' INTO TABLE table_name;
Copying data from file:test.data
Copying file: file:test.data
Loading data to table default.table_name
Table default.table_name stats: [numFiles=1, numRows=0, totalSize=39, rawDataSize=0]
OK
Time taken: 0.601 seconds
hive> SELECT * FROM table_name;
OK
123 sample text
12 ss|gs max`s
Time taken: 0.382 seconds, Fetched: 2 row(s)
I hope this helps.
Upvotes: 1