How to query multiple columns when using a python udf in hive?

Question

I am trying to execute this query:

add FILE /home/user1/test/test_udf.py;

SELECT a.hash_code, col2
FROM (SELECT transform (col2, col3) using 'python test_udf.py' as hash_code, col2
      FROM sample_table) a ;

I am able to successfully generate the hash_code using the udf but the other column (col2) is getting populated as NULL.

Sample Output:

sjhfshhalksjlkfj128798172jasjhas   NULL
ajsdlkja982988290819189089089889   NULL
jhsad817982mnsandkjsahj982398290   NULL

luoluo · Accepted Answer

I know what's wrong with your HiveSql.

In transform (col2, col3) using 'python test_udf.py' as hash_code, col2 FROM sample_table, the hash_code, col2's value is parsed from transform (col2, col3)'s return value.

The clo2 is parsed from transform (col2, col3), which is NULL.

I read the Transform doc, picked up the related info as below.

Transform/Map-Reduce Syntax

SELECT TRANSFORM '(' expression (',' expression)* ')'
  (inRowFormat)?
  USING 'my_reduce_script'
  ( AS colName (',' colName)* )?
  (outRowFormat)? (outRecordReader)?

You'd better not mix transform with other select, as the syntax won't support.

Update：

There is a hack to do what you want: let the test_udf.py return hash_code col2. So you can parse hash_code, col2 from it. This would solve your problem.

How to query multiple columns when using a python udf in hive?

Answers (1)

Related Questions