Reputation: 97
I have data in the below format with 38 measure columns for various months as shown below.
+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+
| Cust_No | Measure1_month1 | Measure1_month2 | .... | Measure1_month72 | Measure2_month_1 | Measure2_month_2 | ….so on | Measure2_month72 |....Measure38_month1...
+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+
| 1 | 10 | 20 | …. | 500 | 40 | 50 | … | |
| 2 | 20 | 40 | …. | 800 | 70 | 150 | … | |
+---------+-----------------+-----------------+------+------------------+------------------+------------------+---------+------------------+
I want to achieve the below format using PYSPARK.
+---------+-------+----------+----------+
| CustNum | Month | Measure1 | Measure2.......measure38 |
+---------+-------+----------+----------+
| 1 | 1 | 10 | 30 |
| 1 | 2 | 20 | 40 |
| 1 | 3 | 30 | 80 |
| 1 | 4 | 70 | 90 |
| 1 | 5 | 40 | 100 |
| . | . | . | . |
| . | . | . | . |
| 1 | 72 | 700 | 50 |
+---------+-------+----------+----------+
and so on for every customer number
Can you please help me with this?
Thanks
Upvotes: 1
Views: 174
Reputation: 5526
IIUC, you need wide to long
kind of transformation which can be achieved by stack
in pyspark
I created a sample dataframe with 5 months data
df = spark.createDataFrame([(1,10,20,30,40,50,10,20,30,40,50),(2,10,20,30,40,50,10,20,30,40,50)],['cust','Measrue1_month1','Measrue1_month2','Measrue1_month3','Measrue1_month4','Measrue1_month5','Measrue2_month1','Measrue2_month2','Measrue2_month3','Measrue2_month4','Measrue2_month5'])
Now generating the clause for stack operation. Can be done in better ways but here is the most simplest example
Measure1 = [i for i in df.columns if i.startswith('Measrue1')]
Measure2 = [i for i in df.columns if i.startswith('Measrue2')]
final = []
for i in Measure1:
for j in Measure2:
if(i.split('_')[1]==j.split('_')[1]):
final.append((i,j))
rows = len(final)
values = ','.join([f"'{i.split('_')[1]}',{i},{j}" for i,j in final])
Now actually applying the stack operation
df.select('cust',expr(f'''stack({rows},{values})''').alias('Month','Measure1','Measure2')).show()
+----+------+--------+--------+
|cust| Month|Measure1|Measure2|
+----+------+--------+--------+
| 1|month1| 10| 10|
| 1|month2| 20| 20|
| 1|month3| 30| 30|
| 1|month4| 40| 40|
| 1|month5| 50| 50|
| 2|month1| 10| 10|
| 2|month2| 20| 20|
| 2|month3| 30| 30|
| 2|month4| 40| 40|
| 2|month5| 50| 50|
+----+------+--------+--------+
Upvotes: 1