Reputation: 53
I want to put aggregated data into memory but getting error.Any suggestion ??
orders = spark.read.json("/user/order_items_json")
df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id")
df_2.persist(StorageLevel.MEMORY_ONLY)**
Traceback (most recent call last): File "", line 1, in AttributeError: 'GroupedData' object has no attribute 'persist'
Upvotes: 1
Views: 66
Reputation: 31530
Spark requires aggregation expression on grouped data.
If you don't need any aggregations on the grouped data then we can have some dummy aggregation like first,count...etc and drop the column from .select
like below:
import pyspark
df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id").agg(first(lit("1"))).select("order_item_order_id")
#or
df_2 = orders.where("order_item_order_id == 2").groupby("order_item_order_id").count().select("order_item_order_id")
df_2.persist(pyspark.StorageLevel.MEMORY_ONLY)
Upvotes: 1