Reputation: 9865
I have a dataframe with the folowing types:
>>> mydf.printSchema()
root
|-- protocol: string (nullable = true)
|-- source_port: long (nullable = true)
|-- bytes: long (nullable = true)
And when I try to aggregate it like this:
df_agg = mydf.groupBy('protocol').agg(sum('bytes'))
I am being told:
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Now, this does not make sense to me, since I see the types are fine for aggregation in printSchema()
as you can see above.
So, I tried converting it to integer just incase:
mydf_converted = mydf.withColumn("converted",mydf["bytes_out"].cast(IntegerType()).alias("bytes_converted"))
but still failure:
my_df_agg_converted = mydf_converted.groupBy('protocol').agg(sum('converted'))
TypeError: unsupported operand type(s) for +: 'int' and 'str'
How to fix it? I looked at this question, but the fix did not help me at all - same issue: Sum operation on PySpark DataFrame giving TypeError when type is fine
Upvotes: 4
Views: 5052
Reputation: 10086
Python is confused between its built-in sum
function and the pyspark aggregation sum
function you want to use. So you're basically passing a string 'converted'
to the python built-in sum
function which expects an iterable of int.
Try loading pyspark functions
with an alias instead:
import pyspark.sql.functions as psf
my_df_agg_converted = mydf_converted.groupBy('protocol').agg(psf.sum('converted'))
This will tell it to use the pyspark
function rather than the built in one.
Upvotes: 15
Reputation: 174
I think you should try converting it to string.
The first type is the one you are using, and the second type is the one it wants
Upvotes: 0