Reputation: 63

Split the PySpark dataframes against number of records

I'm working on a pyspark dataframe having around 100000 records and I want to create new dataframes of around 20000 records each.How can I achieve it?

Upvotes: 1

Answers (1)

Luiz Viola

Reputation: 2436

It can be dynamic but here is a lazy way to do it

#Creates a random DF with 100000 rows
from pyspark.sql import functions as F
df = spark.range(0, 100001).withColumn('rand_col', F.rand()).drop('id')


from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("index", row_number().over(w)) #creates a index column to split the DF

df1 = df.filter(F.col('index') < 20001)
df2 = df.filter((F.col('index') >= 20001) & (F.col('index') < 40001))
df3 = df.filter((F.col('index') >= 40001) & (F.col('index') < 60001))
df4 = df.filter((F.col('index') >= 60001) & (F.col('index') < 80001))
df5 = df.filter((F.col('index') >= 80001) & (F.col('index') < 100001))

Upvotes: 2

Split the PySpark dataframes against number of records

Answers (1)

Related Questions