Arun
Arun

Reputation: 67

Raw Data Transformation

I'm having data like

columnname value table
a 1 X
b 2 X
a 3 X
b 4 X
a 5 X
b 6 X

and need to transform into

table a b
X 1 2
X 3 4
X 5 6

Upvotes: 0

Views: 118

Answers (2)

Vamsi Bitra
Vamsi Bitra

Reputation: 2739

I tried to reproduce the same in my environment and I got below results:

Please follow below code:

I created sample data frame as per above requirement:

rdd = sc.parallelize(
    [
        ("a",1,"X"), ("b",2,"X"), ("a",3,"X"),("b",4,"X"),("a",5,"X"),("b",6,"X")
    ]
)
df_data = sqlContext.createDataFrame(rdd, ["columnname","value","table"])    
df_data.show() 

enter image description here

Then, I Converted the spark Dataframe to pandas Dataframe using toPandas() and I tried with pivot and groupby functions to rotate data from one column to multiple columns and also it will transform rows to columns.

Code:

from pyspark.sql.functions import lit

#converted dataframe to pandas
pdf = df_data.to_pandas_on_spark()
req=[]
for i in range(1,(len(pdf)//2+1)):
    req.extend([i,i])
print(req)

pdf['counter']=req
print(pdf)

df1 = pdf.to_spark()
df1.show()
    
# From above dataframe to get the maximum values I am using group by counter,pivot by columnname. max by value with columns

df1.groupby('counter').pivot('columnname').max('value').withColumn('table',lit('x')).select('table','a','b').show()

Running above code. I got this output:

enter image description here

enter image description here

Upvotes: 1

DannySlor
DannySlor

Reputation: 4620

select tbl
      ,a
      ,b
from
(
select *, (row_number() over(order by value)-1)/2 as rn
from   t
) t 
pivot(max(value) for columnname in(a, b)) p 
tbl a b
X 1 2
X 3 4
X 5 6

Fiddle

Upvotes: 1

Related Questions