Reputation: 56
I have below data
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|restaurant_id| restaurant_name| city|state|postal_code| stars|review_count|cuisine_name|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
| 62112| Neptune Oyster| Boston| MA| 02113|4.500000000000000000| 5115| American|
| 62112| Neptune Oyster| Boston| MA| 02113|4.500000000000000000| 5115| Thai|
| 60154|Giacomo's Ristora...| Boston| MA| 02113|4.000000000000000000| 3520| Italian|
| 61455|Atlantic Fish Com...| Boston| MA| 02116|4.000000000000000000| 2575| American|
| 57757| Top of the Hub| Boston| MA| 02199|3.500000000000000000| 2273| American|
| 58631| Carmelina's| Boston| MA| 02113|4.500000000000000000| 2250| Italian|
| 58895| The Beehive| Boston| MA| 02116|3.500000000000000000| 2184| American|
| 56517|Lolita Cocina & T...| Boston| MA| 02116|4.000000000000000000| 2179| American|
| 56517|Lolita Cocina & T...| Boston| MA| 02116|4.000000000000000000| 2179| Mexican|
| 58440| Toro| Boston| MA| 02118|4.000000000000000000| 2175| Spanish|
| 58615| Regina Pizzeria| Boston| MA| 02113|4.000000000000000000| 2071| Italian|
| 58723| Gaslight| Boston| MA| 02118|4.000000000000000000| 2056| American|
| 58723| Gaslight| Boston| MA| 02118|4.000000000000000000| 2056| French|
| 60920| Modern Pastry Shop| Boston| MA| 02113|4.000000000000000000| 2042| Italian|
| 59453|Gourmet Dumpling ...| Boston| MA| 02111|3.500000000000000000| 1990| Taiwanese|
| 59453|Gourmet Dumpling ...| Boston| MA| 02111|3.500000000000000000| 1990| Chinese|
| 59204|Russell House Tavern|Cambridge| MA| 02138|4.000000000000000000| 1965| American|
| 60732|Eastern Standard ...| Boston| MA| 02215|4.000000000000000000| 1890| American|
| 60732|Eastern Standard ...| Boston| MA| 02215|4.000000000000000000| 1890| French|
| 56970| Border Café|Cambridge| MA| 02138|4.000000000000000000| 1880| Mexican|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
I want to partition data based of City,State and Cuisine and order by stars and review count and finally limit the records per partition.
Can this be done with pyspark.
Upvotes: 0
Views: 1129
Reputation: 3232
You can add row_number
to the partitions after windowing and filter based on this to limit records per window. You can control the maximum number of rows per window using max_number_of_rows_per_partition
variable in the code below.
Since your question did not include the way you want
stars
andreview_count
ordered, I have assumed them to be descending.
import pyspark.sql.functions as F
from pyspark.sql import Window
window_spec = Window.partitionBy("city", "state", "cuisine_name")\
.orderBy(F.col("stars").desc(), F.col("review_count").desc())
max_number_of_rows_per_partition = 3
df.withColumn("row_number", F.row_number().over(window_spec))\
.filter(F.col("row_number") <= max_number_of_rows_per_partition)\
.drop("row_number")\
.show(200, False)
Upvotes: 1