Reputation: 67
I'm having data like
columnname | value | table |
---|---|---|
a | 1 | X |
b | 2 | X |
a | 3 | X |
b | 4 | X |
a | 5 | X |
b | 6 | X |
and need to transform into
table | a | b |
---|---|---|
X | 1 | 2 |
X | 3 | 4 |
X | 5 | 6 |
Upvotes: 0
Views: 118
Reputation: 2739
I tried to reproduce the same in my environment and I got below results:
Please follow below code:
I created sample data frame as per above requirement:
rdd = sc.parallelize(
[
("a",1,"X"), ("b",2,"X"), ("a",3,"X"),("b",4,"X"),("a",5,"X"),("b",6,"X")
]
)
df_data = sqlContext.createDataFrame(rdd, ["columnname","value","table"])
df_data.show()
Then, I Converted the spark Dataframe to pandas Dataframe using toPandas()
and I tried with pivot and groupby functions to rotate data from one column to multiple columns and also it will transform rows to columns.
Code:
from pyspark.sql.functions import lit
#converted dataframe to pandas
pdf = df_data.to_pandas_on_spark()
req=[]
for i in range(1,(len(pdf)//2+1)):
req.extend([i,i])
print(req)
pdf['counter']=req
print(pdf)
df1 = pdf.to_spark()
df1.show()
# From above dataframe to get the maximum values I am using group by counter,pivot by columnname. max by value with columns
df1.groupby('counter').pivot('columnname').max('value').withColumn('table',lit('x')).select('table','a','b').show()
Running above code. I got this output:
Upvotes: 1