Reputation: 629
Trying perform aggregation my dataframe in Apache Spark (PySpark) using aggregation.
+----+---+---+
|name|age| id|
+----+---+---+
|Mark| 4| 1|
|Mark| 4| 2|
|Mark| 5| 3|
|Mark| 5| 4|
|Mark| 5| 5|
|Mark| 6| 6|
|Mark| 8| 7|
+----+---+---+
I have the following code that gives me a distinct count of records for one row:
old_table.groupby('name').agg(countDistinct('age'))
I try to add a normal count as another output of the aggregation, but it throws an error:
old_table.groupby('name').agg(countDistinct('age'), count('age))
Error:
NameError: name 'count' is not defined
Is there any way to add count to the distinct count to my output, such that I will have an output table like below?
+----+-------------+-----+
|name|countDistinct|count|
+----+-------------+-----+
|Mark| 4| 7|
+----+-------------+-----+
Upvotes: 5
Views: 13616
Reputation: 401
You are using the built-in function 'count' which expects an iterable object, not a column name.
You need to explicitly import the 'count' function with the same name from pyspark.sql.functions
from pyspark.sql.functions import count as _count
old_table.groupby('name').agg(countDistinct('age'), _count('age'))
Upvotes: 5