Josh
Josh

Reputation: 768

How to split a Pyspark dataframe while limiting the number of rows?

I am sending data from a dataframe to an API that has a limit of 50,000 rows. Say my dataframe has 70,000 rows, how can I split it into separate dataframes, each with a max row count of 50,000? These do not have to be even and the data order does not matter.

Upvotes: 6

Views: 16952

Answers (3)

xpnerd
xpnerd

Reputation: 11

Addition on @frosty his answer:

limited_df = df.limit(50000).cache()
rest_df = original_df.subtract(limited_df)

.cache() is advised for consistency, because without it limited_df and rest_df can have overlapping rows. That behaviour is due to PySpark running .limit() distributed and multiple times (once for limited_df and once for rest_df).

p.s. reason for separate answer: I couldn't comment yet.

Upvotes: 1

Shubham Jain
Shubham Jain

Reputation: 5526

You can achieve the following by using row_number and then splitting for every 50000 rows

#order by any column to populate the row number
window=Window.orderBy('ID')
length=df1.count()
df2=df1.withColumn('row',f.row_number().over(window))

step=50000
for i in range(1,length,step):
    df3 = df2.filter((f.col('row')>=i) & (f.col('row')<=i+step-1))
    #Here perform your API call as it will contain only 50000 rows at one time 

Upvotes: 3

Frosty
Frosty

Reputation: 698

Workaround for this can be to use .limit() function. You can do something like: let's say your main df with 70k rows is original_df. So you can do like

limited_df = df.limit(50000)

for the very first time to get the 50k rows and for the next rows you can do

original_df.subtract(limited_df)

and you will get the remaining rows. You can even do .limit() for the subtracted df too if needed.

UPDATED: You can do it for any number of rows present in dataframe. Let's say at one point if your dataframe has 30000 rows and if you did df.limit(50000), it's not gonna throw any error and will just return the 30k rows present in dataframe.

Upvotes: 3

Related Questions