Reputation: 1357
I am using Python 2.6.6 and Spark 1.6.0. I have df
like this:
id | name | number |
--------------------------
1 | joe | 148590 |
2 | bob | 148590 |
2 | steve | 279109 |
3 | sue | 382901 |
3 | linda | 148590 |
Whenever I try to run something like
df2 = df.groupBy('id','length','type').pivot('id').agg(collect_list('name'))
, I get the following error
pyspark.sql.utils.AnalysisException: u'undefined function collect_list;'
Why is this?
I have also tried:
hive_context = HiveContext(sc)
df2 = df.groupBy('id','length','type').pivot('id').agg(hive_context.collect_list('name'))
and get the error:
AttributeError: 'HiveContext' object has no attribute 'collect_list'
Upvotes: 2
Views: 1596
Reputation: 2311
Here collect_list
looks like a user-defined function. PySpark API only supports a handful of predefined functions like sum, count etc
If you are referring to any other code, please ensure you have the collect_list function defined somewhere. To import the collectivist function add below line in the top
from pyspark.sql import functions as F
And then change your code as:
df.groupBy('id','length','type').pivot('id').agg(F.collect_list(name))
If you have it already defined, try below snippet.
df.groupBy('id','length','type').pivot('id').agg({'name':'collect_list'})
Upvotes: 2