Reputation: 63
I'm working on a pyspark dataframe having around 100000 records and I want to create new dataframes of around 20000 records each.How can I achieve it?
Upvotes: 1
Views: 323
Reputation: 2436
It can be dynamic but here is a lazy way to do it
#Creates a random DF with 100000 rows
from pyspark.sql import functions as F
df = spark.range(0, 100001).withColumn('rand_col', F.rand()).drop('id')
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("index", row_number().over(w)) #creates a index column to split the DF
df1 = df.filter(F.col('index') < 20001)
df2 = df.filter((F.col('index') >= 20001) & (F.col('index') < 40001))
df3 = df.filter((F.col('index') >= 40001) & (F.col('index') < 60001))
df4 = df.filter((F.col('index') >= 60001) & (F.col('index') < 80001))
df5 = df.filter((F.col('index') >= 80001) & (F.col('index') < 100001))
Upvotes: 2