Brian Waters
Brian Waters

Reputation: 629

agg(count) in Apache Spark not working

Trying perform aggregation my dataframe in Apache Spark (PySpark) using aggregation.

+----+---+---+
|name|age| id|
+----+---+---+
|Mark|  4|  1|
|Mark|  4|  2|
|Mark|  5|  3|
|Mark|  5|  4|
|Mark|  5|  5|
|Mark|  6|  6|
|Mark|  8|  7|
+----+---+---+

I have the following code that gives me a distinct count of records for one row:

old_table.groupby('name').agg(countDistinct('age'))

I try to add a normal count as another output of the aggregation, but it throws an error:

old_table.groupby('name').agg(countDistinct('age'), count('age))

Error:

NameError: name 'count' is not defined

Is there any way to add count to the distinct count to my output, such that I will have an output table like below?

+----+-------------+-----+
|name|countDistinct|count|
+----+-------------+-----+
|Mark|            4|    7|
+----+-------------+-----+

Upvotes: 5

Views: 13616

Answers (1)

Enrique VC
Enrique VC

Reputation: 401

You are using the built-in function 'count' which expects an iterable object, not a column name.

You need to explicitly import the 'count' function with the same name from pyspark.sql.functions

from pyspark.sql.functions import count as _count

old_table.groupby('name').agg(countDistinct('age'), _count('age'))

Upvotes: 5

Related Questions