PySpark - undefined function collect_list

Question

I am using Python 2.6.6 and Spark 1.6.0. I have df like this:

id | name      |  number |
-------------------------- 
1  | joe       | 148590  |
2  | bob       | 148590  |
2  | steve     | 279109  |
3  | sue       | 382901  |
3  | linda     | 148590  |

Whenever I try to run something like df2 = df.groupBy('id','length','type').pivot('id').agg(collect_list('name')), I get the following error pyspark.sql.utils.AnalysisException: u'undefined function collect_list;' Why is this?

I have also tried: hive_context = HiveContext(sc) df2 = df.groupBy('id','length','type').pivot('id').agg(hive_context.collect_list('name')) and get the error:

AttributeError: 'HiveContext' object has no attribute 'collect_list'

sam · Accepted Answer

Here collect_list looks like a user-defined function. PySpark API only supports a handful of predefined functions like sum, count etc

If you are referring to any other code, please ensure you have the collect_list function defined somewhere. To import the collectivist function add below line in the top

from pyspark.sql import functions as F

And then change your code as:

 df.groupBy('id','length','type').pivot('id').agg(F.collect_list(name))

If you have it already defined, try below snippet.

df.groupBy('id','length','type').pivot('id').agg({'name':'collect_list'})

PySpark - undefined function collect_list

Answers (1)

Related Questions