D_Tiwari
D_Tiwari

Reputation: 53

Not able to put aggregated data into memory

I want to put aggregated data into memory but getting error.Any suggestion ??

orders = spark.read.json("/user/order_items_json")

df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id")

df_2.persist(StorageLevel.MEMORY_ONLY)**

Traceback (most recent call last): File "", line 1, in AttributeError: 'GroupedData' object has no attribute 'persist'

Upvotes: 1

Views: 66

Answers (1)

notNull
notNull

Reputation: 31530

Spark requires aggregation expression on grouped data.

If you don't need any aggregations on the grouped data then we can have some dummy aggregation like first,count...etc and drop the column from .select like below:

import pyspark

df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id").agg(first(lit("1"))).select("order_item_order_id")
#or
df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id").count().select("order_item_order_id")

df_2.persist(pyspark.StorageLevel.MEMORY_ONLY)

Upvotes: 1

Related Questions