Reputation: 383
I have a dataframe:
data = [{"ID": 'asriыjewfsldflsar2','val':5},
{"ID": 'dsgвarwetreg','val':89},
{"ID": 'ewrсt43gdfb','val':36},
{"ID": 'q23м4534tgsdg','val':58},
{"ID": '34tя5erbdsfgv','val':52},
{"ID": '43t2ghnaef','val':123},
{"ID": '436tываhgfbds','val':457},
{"ID": '435t5вч3htrhnbszfdhb','val':54},
{"ID": '35yteвrhfdhbxfdbn','val':1},
{"ID": '345ghаывserh','val':1},
{"ID": 'asrijываewfsldflsar2','val':5},
{"ID": 'dsgarываwetreg','val':89},
{"ID": 'ewrt433gdfb','val':36},
{"ID": 'q2345выа34tgsdg','val':58},
{"ID": '34t5eоrbdsfgv','val':52},
{"ID": '43tghолnaef','val':123},
{"ID": '436thапgfbds','val':457},
{"ID": '435t5укн3htrhnbszfdhb','val':54},
{"ID": '35ytк3erhfdhbxfdbn','val':1},
{"ID": '345g244hserh','val':1}
]
df = spark.createDataFrame(data)
I want to split the rows into 4 groups, I used to be able to do this with the row_number()
:
.withColumn('part', F.row_number().over(Window.orderBy(F.lit(1))) % n)
But unfortunately this method does not suit me, because I have a large dataframe that will not fit into memory. I tried to use the hash function, but I think I'm doing it wrong
df2 = df.withColumn('hashed_name', (F.hash('ID') % N))\
.withColumn('divide',F.floor(F.col('hashed_name')/13))\
.sort('divide')
Maybe there is another way to split rows besides than row_number
?
Upvotes: 0
Views: 459
Reputation: 1224
Hi you can use coalesce() to force exact number of partitions and latter you can use partition number for future queries.
df1=df.coalesce(4)
df1.createOrReplaceTempView('df')
espsql="select x.*,spark_partition_id() as part from df x"
df_new=spark.sql(espsql)
newsql="select distinct part from df_new"
spark.sql(newsql).take(5)
Upvotes: 0
Reputation: 110
you can use partitionBy() when trying to save the dataframe in delta format.
df.write.partitionBy("ColumnName").format("delta").save("path_to_save_the_dataframe",header=True,mode="overwrite")
hope this helps!
Upvotes: 1