I'm having data like columnname value table a 1 X b 2 X a 3 X b 4 X a 5 X b 6 X and need to transform into table a b X 1 2 X 3 4 X 5 6

python-3.xapache-sparkpysparkapache-spark-sqlazure-databricks

Reputation: 67

Raw Data Transformation

I'm having data like

columnname	value	table
a	1	X
b	2	X
a	3	X
b	4	X
a	5	X
b	6	X

and need to transform into

table	a	b
X	1	2
X	3	4
X	5	6

Upvotes: 0

Answers (2)

Vamsi Bitra

Reputation: 2739

I tried to reproduce the same in my environment and I got below results:

Please follow below code:

I created sample data frame as per above requirement:

rdd = sc.parallelize(
    [
        ("a",1,"X"), ("b",2,"X"), ("a",3,"X"),("b",4,"X"),("a",5,"X"),("b",6,"X")
    ]
)
df_data = sqlContext.createDataFrame(rdd, ["columnname","value","table"])    
df_data.show()

enter image description here

Then, I Converted the spark Dataframe to pandas Dataframe using toPandas() and I tried with pivot and groupby functions to rotate data from one column to multiple columns and also it will transform rows to columns.

Code:

from pyspark.sql.functions import lit

#converted dataframe to pandas
pdf = df_data.to_pandas_on_spark()
req=[]
for i in range(1,(len(pdf)//2+1)):
    req.extend([i,i])
print(req)

pdf['counter']=req
print(pdf)

df1 = pdf.to_spark()
df1.show()
    
# From above dataframe to get the maximum values I am using group by counter,pivot by columnname. max by value with columns

df1.groupby('counter').pivot('columnname').max('value').withColumn('table',lit('x')).select('table','a','b').show()

Running above code. I got this output:

Upvotes: 1

DannySlor

Reputation: 4620

select tbl
      ,a
      ,b
from
(
select *, (row_number() over(order by value)-1)/2 as rn
from   t
) t 
pivot(max(value) for columnname in(a, b)) p

tbl	a	b
X	1	2
X	3	4
X	5	6

Fiddle

Upvotes: 1

Raw Data Transformation

Answers (2)

Related Questions