Reputation: 2075
I am trying to insert data into a hive table from an input file which is a csv formatted as follows:-
stringA,"stringB","stringC",stringD,"stringE","stringF"
where stringF has the delimiter ',' within itself.
I tried with the following input regex but i get all NULL values in my table:-
create external table tablName(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
("input.regex" = (([^ ]*),(\"[^\"]\"),(\"[^\"]\"),([^ ]*),(\"[^\"]\"),(\"[^\"]\"))
Location '/path/to/data'
Upvotes: 0
Views: 206
Reputation: 1933
The backslashes may not be necessary depending upon where this regex is used. Hive may need them, python does not. The outermost braces are acting as another capture group and are causing to match the whole string again.
>>> input = 'stringA,"stringB","stringC",stringD,"stringE","stringF"'
>>> print (input)
stringA,"stringB","stringC",stringD,"stringE","stringF"
>>> p = re.compile( '(([^"]+),(\"[^"]+"),(\"[^"]+"),([^"]+),(\"[^\"]+"),(\"[^"]+"))' )
>>> p.findall(input)
[('stringA,"stringB","stringC",stringD,"stringE","stringF"', 'stringA', '"stringB"', '"stringC"', 'stringD', '"stringE"', '"stringF"')]
>>> p = re.compile( '([^"]+),("[^"]+"),("[^"]+"),([^"]+),("[^"]+"),("[^"]+")' )
>>> p.findall(input)
[('stringA', '"stringB"', '"stringC"', 'stringD', '"stringE"', '"stringF"')]
>>>
Upvotes: 0
Reputation: 240
Using the following regex should match your pattern given the criteria.
(([^\"]+),(\"[^\"]+\"),(\"[^\"]+\"),([^\"]+),(\"[^\"]+\"),(\"[^\"]+\"))
or as mentioned, depending on language requirements (I'm not familiar with hive) you may not need to escape the quotes inside the brackets:
(([^"]+),(\"[^"]+\"),(\"[^"]+\"),([^"]+),(\"[^"]+\"),(\"[^"]+\"))
Using + matches at least the presence, * doesn't require a match.
Upvotes: 0
Reputation: 2738
A [ ] expression only defines a class of characters. If you don't specify a quantifier, it means: Precisely one occurrences of a character of the defined class. Therefore, your expression presumably has to look like this:
(([^ ]*),(\"[^\"]*\"),(\"[^\"]*\"),([^ ]*),(\"[^\"]*\"),(\"[^\"]*\"))
Upvotes: 1