Reputation: 205
I have this data where I am trying to do a pivot using spark sql. The end result needs to be this way .
+------------------+------------------+------------------+------------------+------------------+
|Product |AVG_CENTRAL |AVG_EAST |AVG_WEST |avg(Autonumber) |
+------------------+------------------+------------------+------------------+------------------+
|Air Purifiers |300.3333333333333 |449.875 |348.5 |399.7692307692308 |
|Art Supplies |352.42857142857144|349.48837209302326|324.94444444444446|344.14666666666665|
|Baby Food |372.0 |null |null |372.0 |
|Beverages |320.0 |409.0 |null |379.3333333333333 |
|Binder Clips |70.5 |551.8571428571429 |494.0 |453.8181818181818 |
|Binding Machines |null |null |361.0 |361.0 |
|Binding Supplies |610.0 |462.2 |616.0 |529.8888888888889 |
|Biscuits |505.0 |469.5 |null |481.3333333333333 |
|Bread |null |266.0 |328.0 |297.0 |
|Business Envelopes|421.0 |356.25 |null |369.2 |
|Cakes |null |314.0 |null |314.0 |
|Cardboard Storage |409.0 |null |null |409.0 |
|Cereals |583.0 |356.38461538461536|278.6666666666667 |356.0 |
|Clocks |null |220.2 |445.5 |284.57142857142856|
|Colored Envelopes |615.0 |524.0 |222.0 |471.25 |
|Computer Paper |358.84615384615387|370.02564102564105|377.6666666666667 |369.9142857142857 |
|Copy Paper |300.2857142857143 |454.8888888888889 |223.55555555555554|361.8235294117647 |
|End Tables |null |36.0 |234.0 |135.0 |
|File Labels |351.8333333333333 |325.52941176470586|379.0 |344.4193548387097 |
|Frames |471.0 |258.5 |709.0 |325.84615384615387|
+------------------+------------------+------------------+------------------+------------------+
I tried the following query for pivot.
spark.sql("""SELECT * FROM (SELECT Product,Region,Autonumber FROM Test) TEST PIVOT (avg(Autonumber) FOR Region in ('Central' AVG_CENTRAL, 'East' AVG_EAST, 'West' AVG_WEST )) ORDER BY PRODUCT""")
The resulting dataframe was this ->
+------------------+------------------+------------------+------------------+
| Product| AVG_CENTRAL| AVG_EAST| AVG_WEST|
+------------------+------------------+------------------+------------------+
| Air Purifiers| 300.3333333333333| 449.875| 348.5|
| Art Supplies|352.42857142857144|349.48837209302326|324.94444444444446|
| Baby Food| 372.0| null| null|
| Beverages| 320.0| 409.0| null|
| Binder Clips| 70.5| 551.8571428571429| 494.0|
| Binding Machines| null| null| 361.0|
| Binding Supplies| 610.0| 462.2| 616.0|
| Biscuits| 505.0| 469.5| null|
| Bread| null| 266.0| 328.0|
|Business Envelopes| 421.0| 356.25| null|
| Cakes| null| 314.0| null|
| Cardboard Storage| 409.0| null| null|
| Cereals| 583.0|356.38461538461536| 278.6666666666667|
| Clocks| null| 220.2| 445.5|
| Colored Envelopes| 615.0| 524.0| 222.0|
| Computer Paper|358.84615384615387|370.02564102564105| 377.6666666666667|
| Copy Paper| 300.2857142857143| 454.8888888888889|223.55555555555554|
| End Tables| null| 36.0| 234.0|
| File Labels| 351.8333333333333|325.52941176470586| 379.0|
| Frames| 471.0| 258.5| 709.0|
+------------------+------------------+------------------+------------------+
I wanted to get the avg of the Autonumber column for each product which I did after these steps.
spark.sql("""SELECT PRODUCT , AVG(Autonumber) from test group by product""").show
+--------------------+------------------+
| PRODUCT| avg(Autonumber)|
+--------------------+------------------+
|Fruits and Vegeta...|401.09541984732823|
| Clocks|284.57142857142856|
| File Labels| 344.4193548387097|
| Art Supplies|344.14666666666665|
| Specialty Envelopes| 247.0|
| Copy Paper| 361.8235294117647|
| Computer Paper| 369.9142857142857|
| Highlighters| 17.5|
| Standard Labels|400.42857142857144|
| Portable Storage| 295.4|
| Pencils|419.14285714285717|
| Round Ring Binders|292.93333333333334|
| Meat| 365.1111111111111|
| Writings Pads| 336.5833333333333|
| Cereals| 356.0|
| End Tables| 135.0|
| Storage Cabinets| 267.5|
| Frames|325.84615384615387|
| Office Chairs| 376.0|
| Air Purifiers| 399.7692307692308|
+--------------------+------------------+
scala> df1.join(df2 , "product").show(false)
Only then I could acheive my final result . Is there any way I can acheive this in the pivot query itself ? Any help would be great. Also its really difficult to find blog posts on the sql syntax of pivot in spark.
Upvotes: 1
Views: 1219
Reputation: 205
I found a way of doing this. I went through this article and found that rollup and cube can be used to resolve this
Code:
spark.sql("SELECT * FROM (select region , product , avg(autonumber) as autonumber from test group by cube (region , product) order by product NULLS LAST , region nulls last) pivot (avg(autonumber) for region in ('Central' Central , 'East' East , 'West' West , null Total)) order by product nulls last").show()
Output
+--------------------+------------------+------------------+------------------+------------------+
| product| Central| East| West| Total|
+--------------------+------------------+------------------+------------------+------------------+
| Air Purifiers| 300.3333333333333| 449.875| 348.5| 399.7692307692308|
| Art Supplies|352.42857142857144|349.48837209302326|324.94444444444446|344.14666666666665|
| Baby Food| 372.0| null| null| 372.0|
| Beverages| 320.0| 409.0| null| 379.3333333333333|
| Binder Clips| 70.5| 551.8571428571429| 494.0| 453.8181818181818|
| Binding Machines| null| null| 361.0| 361.0|
| Binding Supplies| 610.0| 462.2| 616.0| 529.8888888888889|
| Biscuits| 505.0| 469.5| null| 481.3333333333333|
| Bread| null| 266.0| 328.0| 297.0|
| Business Envelopes| 421.0| 356.25| null| 369.2|
| Cakes| null| 314.0| null| 314.0|
| Cardboard Storage| 409.0| null| null| 409.0|
| Cereals| 583.0|356.38461538461536| 278.6666666666667| 356.0|
| Clocks| null| 220.2| 445.5|284.57142857142856|
| Colored Envelopes| 615.0| 524.0| 222.0| 471.25|
| Computer Paper|358.84615384615387|370.02564102564105| 377.6666666666667| 369.9142857142857|
| Copy Paper| 300.2857142857143| 454.8888888888889|223.55555555555554| 361.8235294117647|
| End Tables| null| 36.0| 234.0| 135.0|
| File Labels| 351.8333333333333|325.52941176470586| 379.0| 344.4193548387097|
| Frames| 471.0| 258.5| 709.0|325.84615384615387|
|Fruits and Vegeta...| 493.4| 366.408| 385.4805194805195|401.09541984732823|
| Heavy Duty Binders| 489.6666666666667| 423.6666666666667| 556.0| 481.5|
| Highlighters| null| null| 17.5| 17.5|
| Lighting| null| null| 141.0| 141.0|
| Meat| 469.5| 343.4375| 324.4166666666667| 365.1111111111111|
| Meeting Room Tables| 449.0| 447.0| 401.5|434.57142857142856|
| Metal Bookcases| null| 571.6666666666666| null| 571.6666666666666|
| Office Chairs| 534.4| 193.5| 360.5| 376.0|
| Paper Clips| 459.0|392.77777777777777| 466.0| 423.9375|
| Pencils| 508.0| 441.5| 19.0|419.14285714285717|
| Pins and Tacks| null| 426.5|232.33333333333334| 310.0|
| Portable Storage| 535.0| 251.0| 291.0| 295.4|
| Round Ring Binders| 271.2857142857143| 404.3333333333333| 34.5|292.93333333333334|
| Rubber Bands| null| 412.0| null| 412.0|
| Scissors| null| 374.0| 346.6|354.42857142857144|
| Specialty Envelopes| 448.0| 8.0| 84.0| 247.0|
| Spices| 508.3333333333333| 326.0| null| 462.75|
| Standard Labels| 429.0| 360.7857142857143|451.14285714285717|400.42857142857144|
| Storage Cabinets| 45.0| null| 341.6666666666667| 267.5|
| Surge Protectors| 465.0| 366.0| 348.5| 382.0|
| Writings Pads| 289.5| 384.6666666666667| 286.5| 336.5833333333333|
| null| 427.3090909090909|368.02349869451695|357.15458937198065| 378.0|
+--------------------+------------------+------------------+------------------+------------------+
Upvotes: 3
Reputation: 391
I suggest you use LEFT JOIN
to achieve the results.
Your final column could be fetched using -
df = spark.sql("""
SELECT
A.*, B.avg_autoNumber
FROM
(SELECT Product, AVG_CENTRAL, AVG_WEST, AVG_EAST FROM Test) A
LEFT JOIN (Select Product, avg(ALL) as avg(Autonumber) avg_autoNumber) B
ON A.Product = B.Product
""")
df.show()
Upvotes: 2