Reputation: 48
I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe '|' character. Then I want to group up the 2 adjacent values and separate them into separate rows.
Example, I have a table
id mapper
1 a|0.1|b|0.2
2 c|0.2|d|0.3|e|0.6
3 f|0.6
I am able to split the column by using split(mapper, "\\|")
which gives me the array
id mapper
1 [a,0.1,b,0.2]
2 [c,0.2,d,0.3,e,0.6]
3 [f,0.6]
Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.
id mapper
1 [a,0.1]
1 [b,0.2]
2 [c,0.2]
2 [d,0.3]
2 [e,0.6]
3 [f,0.6]
id mapper
1 a
1 0.1
1 b
1 0.2
etc .......
How can I achieve this?
Upvotes: 0
Views: 1651
Reputation: 1782
I would suggest you to split your pairs split(mapper, '(?<=\\d)\\|(?=\\w)')
, e.g.
split('c|0.2|d|0.3|e|0.6', '(?<=\\d)\\|(?=\\w)')
results in
["c|0.2","d|0.3","e|0.6"]
then explode the resulting array and split by |
.
Update:
If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\\.\\d)\\|(?=\\w|\\d)')
.
Update 2:
OK, the best way is to split on the second |
as follows
split(mapper, '(?<!\\G[^\\|]+)\\|')
e.g.
split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\\G[^\\|]+)\\|')
results in
["6193439|0.0444035224643987","6186654|0.0444035224643987"]
Upvotes: 1