How to Limit and Partition data in PySpqrk Dataframe

Question

I have below data

+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|restaurant_id|     restaurant_name|     city|state|postal_code|               stars|review_count|cuisine_name|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|        62112|      Neptune Oyster|   Boston|   MA|      02113|4.500000000000000000|        5115|    American|
|        62112|      Neptune Oyster|   Boston|   MA|      02113|4.500000000000000000|        5115|        Thai|
|        60154|Giacomo's Ristora...|   Boston|   MA|      02113|4.000000000000000000|        3520|     Italian|
|        61455|Atlantic Fish Com...|   Boston|   MA|      02116|4.000000000000000000|        2575|    American|
|        57757|      Top of the Hub|   Boston|   MA|      02199|3.500000000000000000|        2273|    American|
|        58631|         Carmelina's|   Boston|   MA|      02113|4.500000000000000000|        2250|     Italian|
|        58895|         The Beehive|   Boston|   MA|      02116|3.500000000000000000|        2184|    American|
|        56517|Lolita Cocina & T...|   Boston|   MA|      02116|4.000000000000000000|        2179|    American|
|        56517|Lolita Cocina & T...|   Boston|   MA|      02116|4.000000000000000000|        2179|     Mexican|
|        58440|                Toro|   Boston|   MA|      02118|4.000000000000000000|        2175|     Spanish|
|        58615|     Regina Pizzeria|   Boston|   MA|      02113|4.000000000000000000|        2071|     Italian|
|        58723|            Gaslight|   Boston|   MA|      02118|4.000000000000000000|        2056|    American|
|        58723|            Gaslight|   Boston|   MA|      02118|4.000000000000000000|        2056|      French|
|        60920|  Modern Pastry Shop|   Boston|   MA|      02113|4.000000000000000000|        2042|     Italian|
|        59453|Gourmet Dumpling ...|   Boston|   MA|      02111|3.500000000000000000|        1990|   Taiwanese|
|        59453|Gourmet Dumpling ...|   Boston|   MA|      02111|3.500000000000000000|        1990|     Chinese|
|        59204|Russell House Tavern|Cambridge|   MA|      02138|4.000000000000000000|        1965|    American|
|        60732|Eastern Standard ...|   Boston|   MA|      02215|4.000000000000000000|        1890|    American|
|        60732|Eastern Standard ...|   Boston|   MA|      02215|4.000000000000000000|        1890|      French|
|        56970|         Border Café|Cambridge|   MA|      02138|4.000000000000000000|        1880|     Mexican|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+

I want to partition data based of City,State and Cuisine and order by stars and review count and finally limit the records per partition.

Can this be done with pyspark.

Nithish · Accepted Answer

You can add row_number to the partitions after windowing and filter based on this to limit records per window. You can control the maximum number of rows per window using max_number_of_rows_per_partition variable in the code below.

Since your question did not include the way you want stars and review_count ordered, I have assumed them to be descending.

import pyspark.sql.functions as F
from pyspark.sql import Window

window_spec = Window.partitionBy("city", "state", "cuisine_name")\
                    .orderBy(F.col("stars").desc(), F.col("review_count").desc())

max_number_of_rows_per_partition = 3

df.withColumn("row_number", F.row_number().over(window_spec))\
  .filter(F.col("row_number") <= max_number_of_rows_per_partition)\
  .drop("row_number")\
  .show(200, False)

How to Limit and Partition data in PySpqrk Dataframe

Answers (1)

Related Questions