Partition in dataframe pyspark

Question

I have a dataframe:

data = [{"ID": 'asriыjewfsldflsar2','val':5},
        {"ID": 'dsgвarwetreg','val':89},
        {"ID": 'ewrсt43gdfb','val':36},
        {"ID": 'q23м4534tgsdg','val':58},
        {"ID": '34tя5erbdsfgv','val':52},
        {"ID": '43t2ghnaef','val':123},
        {"ID": '436tываhgfbds','val':457},
        {"ID": '435t5вч3htrhnbszfdhb','val':54},
        {"ID": '35yteвrhfdhbxfdbn','val':1},
        {"ID": '345ghаывserh','val':1},
        {"ID": 'asrijываewfsldflsar2','val':5},
        {"ID": 'dsgarываwetreg','val':89},
        {"ID": 'ewrt433gdfb','val':36},
        {"ID": 'q2345выа34tgsdg','val':58},
        {"ID": '34t5eоrbdsfgv','val':52},
        {"ID": '43tghолnaef','val':123},
        {"ID": '436thапgfbds','val':457},
        {"ID": '435t5укн3htrhnbszfdhb','val':54},
        {"ID": '35ytк3erhfdhbxfdbn','val':1},
        {"ID": '345g244hserh','val':1}
        ]
df = spark.createDataFrame(data)

I want to split the rows into 4 groups, I used to be able to do this with the row_number():

.withColumn('part', F.row_number().over(Window.orderBy(F.lit(1))) % n)

But unfortunately this method does not suit me, because I have a large dataframe that will not fit into memory. I tried to use the hash function, but I think I'm doing it wrong

df2 = df.withColumn('hashed_name', (F.hash('ID') % N))\
.withColumn('divide',F.floor(F.col('hashed_name')/13))\
.sort('divide')

Maybe there is another way to split rows besides than row_number?

Robinhood · Accepted Answer

you can use partitionBy() when trying to save the dataframe in delta format.

df.write.partitionBy("ColumnName").format("delta").save("path_to_save_the_dataframe",header=True,mode="overwrite")

hope this helps!

Partition in dataframe pyspark

Answers (2)

Related Questions