Reputation: 768
I am sending data from a dataframe to an API that has a limit of 50,000 rows. Say my dataframe has 70,000 rows, how can I split it into separate dataframes, each with a max row count of 50,000? These do not have to be even and the data order does not matter.
Upvotes: 6
Views: 16952
Reputation: 11
Addition on @frosty his answer:
limited_df = df.limit(50000).cache()
rest_df = original_df.subtract(limited_df)
.cache()
is advised for consistency, because without it limited_df
and rest_df
can have overlapping rows. That behaviour is due to PySpark running .limit()
distributed and multiple times (once for limited_df
and once for rest_df
).
p.s. reason for separate answer: I couldn't comment yet.
Upvotes: 1
Reputation: 5526
You can achieve the following by using row_number and then splitting for every 50000 rows
#order by any column to populate the row number
window=Window.orderBy('ID')
length=df1.count()
df2=df1.withColumn('row',f.row_number().over(window))
step=50000
for i in range(1,length,step):
df3 = df2.filter((f.col('row')>=i) & (f.col('row')<=i+step-1))
#Here perform your API call as it will contain only 50000 rows at one time
Upvotes: 3
Reputation: 698
Workaround for this can be to use .limit()
function. You can do something like: let's say your main df with 70k rows is original_df. So you can do like
limited_df = df.limit(50000)
for the very first time to get the 50k rows and for the next rows you can do
original_df.subtract(limited_df)
and you will get the remaining rows. You can even do .limit() for the subtracted df too if needed.
UPDATED: You can do it for any number of rows present in dataframe. Let's say at one point if your dataframe has 30000 rows and if you did df.limit(50000), it's not gonna throw any error and will just return the 30k rows present in dataframe.
Upvotes: 3