python-starter
python-starter

Reputation: 300

How to query multiple columns when using a python udf in hive?

I am trying to execute this query:

add FILE /home/user1/test/test_udf.py;

SELECT a.hash_code, col2
FROM (SELECT transform (col2, col3) using 'python test_udf.py' as hash_code, col2
      FROM sample_table) a ;

I am able to successfully generate the hash_code using the udf but the other column (col2) is getting populated as NULL.

Sample Output:

sjhfshhalksjlkfj128798172jasjhas   NULL
ajsdlkja982988290819189089089889   NULL
jhsad817982mnsandkjsahj982398290   NULL

Upvotes: 1

Views: 2074

Answers (1)

luoluo
luoluo

Reputation: 5533

I know what's wrong with your HiveSql.

In transform (col2, col3) using 'python test_udf.py' as hash_code, col2 FROM sample_table, the hash_code, col2's value is parsed from transform (col2, col3)'s return value.

The clo2 is parsed from transform (col2, col3), which is NULL.

I read the Transform doc, picked up the related info as below.

Transform/Map-Reduce Syntax

SELECT TRANSFORM '(' expression (',' expression)* ')'
  (inRowFormat)?
  USING 'my_reduce_script'
  ( AS colName (',' colName)* )?
  (outRowFormat)? (outRecordReader)?

You'd better not mix transform with other select, as the syntax won't support.

Update:

There is a hack to do what you want: let the test_udf.py return hash_code\t col2. So you can parse hash_code, col2 from it. This would solve your problem.

Upvotes: 1

Related Questions