Bryce Ramgovind
Bryce Ramgovind

Reputation: 3257

PySpark - Get the size of each list in group by

I have a massive pyspark dataframe. I need to group by Person and then collect their Budget items into a list, to perform a further calculation. As an example,

a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"),  ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])

Group By:

import pyspark.sql.functions as F
df_grouped = df.groupby('person').agg(F.collect_list("Budget").alias("data"))

Schema:

root
 |-- person: string (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: string (containsNull = true)

However, I am getting a memory error when I try to apply a UDF on each person. How can I get the size (in megabytes or gigbabytes) of each list (data) for each person?

I have done the following, but I am getting nulls

import sys
size_list_udf = F.udf(lambda data: sys.getsizeof(data)/1000, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show()

Output:

+------+--------------------+----+
|person|                data|size|
+------+--------------------+----+
|   Sue|[Household, House...|null|
|   Bob|[Food, Food, Hous...|null|
+------+--------------------+----+

Upvotes: 1

Views: 2864

Answers (1)

pault
pault

Reputation: 43504

You just have one minor issue with your code. sys.getsizeof() returns the size of an object in bytes as an integer. You're dividing this by the integer value 1000 to get kilobytes. In python 2, this returns an integer. However you defined your udf to return a DoubleType(). The simple fix is to divide by 1000.0.

import sys
size_list_udf = f.udf(lambda data: sys.getsizeof(data)/1000.0, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show(truncate=False)
#+------+-----------------------+-----+
#|person|data                   |size |
#+------+-----------------------+-----+
#|Sue   |[Household, Household] |0.112|
#|Bob   |[Food, Food, Household]|0.12 |
#+------+-----------------------+-----+

I have found that in cases where a udf is returning null, the culprit is very frequently a type mismatch.

Upvotes: 1

Related Questions