Rohit Kulkarni
Rohit Kulkarni

Reputation: 65

Pivot issue in databricks

I have dataframe table having values :

id      Country               Interest
00    Russian         Digestion;Destillation

I want to pivot the Interest column and name new column in azure databricks in python like this :

id    Country     Int                  Interest
00Q7  Russ        Digestion            Digestion;Destillation
00Q7  Russ        Destillation         Digestion;Destillation 

Please advise how it can be done

Regards RK

Upvotes: 0

Views: 234

Answers (1)

Saideep Arikontham
Saideep Arikontham

Reputation: 6104

I have created a sample dataframe similar to yours using the following code:

data = [['00Q7','Russian Federation','Digestion;Destillation'],['00Q6','United States','Oils;Automobiles']]

df = spark.createDataFrame(data=data,schema = ['id','country','interests'])
display(df)

enter image description here

  • To get the desired output (like yours), first I have split the data in interests column using pyspark.sql.functions.split.
from pyspark.sql.functions import split,col
 
df1 = df.withColumn("interest", split(col("interests"), ";"))
display(df1)

enter image description here

  • Now I have exploded the new column interest using pyspark.sql.functions.explode to get the required output.
from pyspark.sql.functions import explode
op  = df1.withColumn('interest',explode(col('interest')))
display(op)

enter image description here

UPDATE:

data = [['00Q7','Russian Federation','01_Digestion;02_Destillation']]

df = spark.createDataFrame(data=data,schema = ['id','country','interests'])
#display(df)

from pyspark.sql.functions import split,col
df1 = df.withColumn("interest", split(col("interests"), ";"))

from pyspark.sql.functions import explode
op  = df1.withColumn('interest',explode(col('interest')))

#UPDATE
from pyspark.sql.functions import concat,lit

op.withColumn("set",concat(lit('Set'),split(col('interest'),'_').getItem(0))).show(truncate=False)

enter image description here

UPDATE-2:

pdf['set']= pdf['interest'].str.split('_').str[0]

import numpy as np
pdf["set"] = np.where(pdf["set"].astype(int)<10 , 'Set'+pdf['set'].str[1], 'Set'+pdf['set'])

Upvotes: 1

Related Questions