Why does pyspark agg tell me that datatypes are incorrect here?

Question

I have a dataframe with the folowing types:

>>> mydf.printSchema()
root
 |-- protocol: string (nullable = true)
 |-- source_port: long (nullable = true)
 |-- bytes: long (nullable = true)

And when I try to aggregate it like this:

df_agg = mydf.groupBy('protocol').agg(sum('bytes'))

I am being told:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Now, this does not make sense to me, since I see the types are fine for aggregation in printSchema() as you can see above.

So, I tried converting it to integer just incase:

mydf_converted = mydf.withColumn("converted",mydf["bytes_out"].cast(IntegerType()).alias("bytes_converted"))

but still failure:

my_df_agg_converted = mydf_converted.groupBy('protocol').agg(sum('converted'))

TypeError: unsupported operand type(s) for +: 'int' and 'str'

How to fix it? I looked at this question, but the fix did not help me at all - same issue: Sum operation on PySpark DataFrame giving TypeError when type is fine

MaFF · Accepted Answer

Python is confused between its built-in sum function and the pyspark aggregation sum function you want to use. So you're basically passing a string 'converted' to the python built-in sum function which expects an iterable of int.

Try loading pyspark functions with an alias instead:

import pyspark.sql.functions as psf
my_df_agg_converted = mydf_converted.groupBy('protocol').agg(psf.sum('converted'))

This will tell it to use the pyspark function rather than the built in one.

Answers (2)