Rory
Rory

Reputation: 383

Partition in dataframe pyspark

I have a dataframe:

data = [{"ID": 'asriыjewfsldflsar2','val':5},
        {"ID": 'dsgвarwetreg','val':89},
        {"ID": 'ewrсt43gdfb','val':36},
        {"ID": 'q23м4534tgsdg','val':58},
        {"ID": '34tя5erbdsfgv','val':52},
        {"ID": '43t2ghnaef','val':123},
        {"ID": '436tываhgfbds','val':457},
        {"ID": '435t5вч3htrhnbszfdhb','val':54},
        {"ID": '35yteвrhfdhbxfdbn','val':1},
        {"ID": '345ghаывserh','val':1},
        {"ID": 'asrijываewfsldflsar2','val':5},
        {"ID": 'dsgarываwetreg','val':89},
        {"ID": 'ewrt433gdfb','val':36},
        {"ID": 'q2345выа34tgsdg','val':58},
        {"ID": '34t5eоrbdsfgv','val':52},
        {"ID": '43tghолnaef','val':123},
        {"ID": '436thапgfbds','val':457},
        {"ID": '435t5укн3htrhnbszfdhb','val':54},
        {"ID": '35ytк3erhfdhbxfdbn','val':1},
        {"ID": '345g244hserh','val':1}
        ]
df = spark.createDataFrame(data)

I want to split the rows into 4 groups, I used to be able to do this with the row_number():

.withColumn('part', F.row_number().over(Window.orderBy(F.lit(1))) % n)

But unfortunately this method does not suit me, because I have a large dataframe that will not fit into memory. I tried to use the hash function, but I think I'm doing it wrong

df2 = df.withColumn('hashed_name', (F.hash('ID') % N))\
.withColumn('divide',F.floor(F.col('hashed_name')/13))\
.sort('divide')

Maybe there is another way to split rows besides than row_number?

Upvotes: 0

Views: 459

Answers (2)

zlidime
zlidime

Reputation: 1224

Hi you can use coalesce() to force exact number of partitions and latter you can use partition number for future queries.

df1=df.coalesce(4)
df1.createOrReplaceTempView('df')
espsql="select x.*,spark_partition_id() as part from df  x"
df_new=spark.sql(espsql)
newsql="select distinct part from df_new"
spark.sql(newsql).take(5)

Upvotes: 0

Robinhood
Robinhood

Reputation: 110

you can use partitionBy() when trying to save the dataframe in delta format.

df.write.partitionBy("ColumnName").format("delta").save("path_to_save_the_dataframe",header=True,mode="overwrite")

hope this helps!

Upvotes: 1

Related Questions