Reputation: 23
I have a hive table in the format :
col1. col2. col3.
a1 b1 c1
a1 b1 c2
a1 b2 c2
a1 b2 c3
a2 b3 c1
a2 b4 c1
a2 b4 c2
a2 b4 c3
.
.
Every value in col1 can have multiple values in col2 and every such pair of (col1, col2) can have multiple values of col3.
I am running the query [Q]:
select col1, col2, collect_list(col3) from {table} group by col1, col2;
to get:
a1 b1 [c1, c2]
a1 b2 [c2, c3]
a2 b3 [c1]
a2 b4 [c1, c2, c3]
I want to do some transformations using a python UDF. So Im passing all these columns to UDF using TRANSFORM clause as:
select TRANSFORM ( * ) using 'python udf.py' FROM
(
select col1, col2, concat_ws('\t', collect_list(col3)) from {table} group by col1, col2;
)
I'm using concat_ws to convert the array output to strig from collect_list concatenated by separator. I get col1, col2 in result, but do not get the col3 output.
+---------+---------+
| key| value|
+---------+---------+
|a1 | b1 |
| | null|
|a1 | b2 |
| | null|
|a2 | b3 |
| | null|
|a2 | b4 |
| | null|
+---------+---------+
In my UDF, I just have a print statement that prints the line received from stdin.
import sys
for line in sys.stdin:
try:
print line
except Exception as e:
continue
can someone help figure out why i'm not getting the col3 in my UDF ?
Upvotes: 1
Views: 313
Reputation: 1772
First, you need to parse the line in Python UDF, e.g.,
import sys
for line in sys.stdin:
try:
line = line.strip('\n')
col1, col2, col3 = line.split('\t')
print '\t'.join([col1, col2, col3])
except Exception as e:
continue
Then it's better to use something else instead of \t
in concat_ws
select TRANSFORM ( * ) using 'python udf.py' as (col1, col2, col3)
FROM
(
select col1, col2, concat_ws(',', collect_list(col3)) from {table} group by col1, col2;
Upvotes: 1