user720694
user720694

Reputation: 2075

Hive: What is wrong with this regex?

I am trying to insert data into a hive table from an input file which is a csv formatted as follows:-

stringA,"stringB","stringC",stringD,"stringE","stringF"

where stringF has the delimiter ',' within itself.

I tried with the following input regex but i get all NULL values in my table:-

create external table tablName(col1 string, col2 string, col3 string, col4 string, col5 string, col6 string)

   row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
("input.regex" =  (([^ ]*),(\"[^\"]\"),(\"[^\"]\"),([^ ]*),(\"[^\"]\"),(\"[^\"]\"))
Location '/path/to/data'

Upvotes: 0

Views: 206

Answers (3)

ramana_k
ramana_k

Reputation: 1933

The backslashes may not be necessary depending upon where this regex is used. Hive may need them, python does not. The outermost braces are acting as another capture group and are causing to match the whole string again.

>>> input = 'stringA,"stringB","stringC",stringD,"stringE","stringF"'
>>> print (input)
stringA,"stringB","stringC",stringD,"stringE","stringF"


>>> p = re.compile( '(([^"]+),(\"[^"]+"),(\"[^"]+"),([^"]+),(\"[^\"]+"),(\"[^"]+"))' )

>>> p.findall(input)
[('stringA,"stringB","stringC",stringD,"stringE","stringF"', 'stringA', '"stringB"', '"stringC"', 'stringD', '"stringE"', '"stringF"')]



>>> p = re.compile( '([^"]+),("[^"]+"),("[^"]+"),([^"]+),("[^"]+"),("[^"]+")' )
>>> p.findall(input)
[('stringA', '"stringB"', '"stringC"', 'stringD', '"stringE"', '"stringF"')]
>>> 

Upvotes: 0

ydobonebi
ydobonebi

Reputation: 240

Using the following regex should match your pattern given the criteria.

(([^\"]+),(\"[^\"]+\"),(\"[^\"]+\"),([^\"]+),(\"[^\"]+\"),(\"[^\"]+\"))

or as mentioned, depending on language requirements (I'm not familiar with hive) you may not need to escape the quotes inside the brackets:

(([^"]+),(\"[^"]+\"),(\"[^"]+\"),([^"]+),(\"[^"]+\"),(\"[^"]+\"))

Using + matches at least the presence, * doesn't require a match.

http://regexr.com/3be3i

Upvotes: 0

rplantiko
rplantiko

Reputation: 2738

A [ ] expression only defines a class of characters. If you don't specify a quantifier, it means: Precisely one occurrences of a character of the defined class. Therefore, your expression presumably has to look like this:

(([^ ]*),(\"[^\"]*\"),(\"[^\"]*\"),([^ ]*),(\"[^\"]*\"),(\"[^\"]*\"))

Upvotes: 1

Related Questions