Reputation:
I have a requirement to convert table using pivot.I have tried it in pyspark sql but facing issues.
Store Week xCount
------- ---- ------
102 1 96
101 1 138
105 1 37
109 1 59
101 2 282
102 2 212
105 2 78
109 2 97
105 3 60
102 3 123
101 3 220
109 3 87
I would like it to come out as a pivot table, like this:
Store 1 2 3 .
-----
101 138 282 220
102 96 212 123
105 37
109
code i am using is below
from pyspark import SparkContext,SparkConf
from builtins import int
#from org.spark.com.PySparkDemo import data
from pyspark.sql import Row
from pyspark.sql.context import SQLContext
conf = SparkConf().setAppName("FileSystem").setMaster("local")
sc=SparkContext(conf=conf)
sqlContext=SQLContext(sc)
a = sc.textFile("C:/Users/ashok.darsi/Desktop/Data1.txt")
b = a.map(lambda x:x.split(",")).map(lambda x :Row(store=int(x[0]),week=int(x[1]),xcount=int(x[2])))
c = sqlContext.createDataFrame(b)
#c.show()
c.registerTempTable("d")
e = sqlContext.sql("select * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt")
e.show()
error i am facing
select * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt
---------------------^^^
pyspark.sql.utils.ParseException: "\nmismatched input '(' expecting {<EOF>, ',', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 21)\n\n== SQL ==\nselect * from d pivot(max(xcount)) for week in ([1],[2],[3])) as pivt\n---------------------^^^\n"
SUCCESS: The process with PID 8668 (child process of PID 2192) has been terminated.
SUCCESS: The process with PID 2192 (child process of PID 4940) has been terminated.
SUCCESS: The process with PID 4940 (child process of PID 11192) has been terminated.
Upvotes: 3
Views: 12908
Reputation: 10082
I don't think PIVOT
is inherently supported by HiveQL/Spark SQL.
You can use Dataframe Specific function pivot
here instead :
df.groupBy("store").pivot("week").max("xcount").show()
+-----+---+---+---+
|store| 1| 2| 3|
+-----+---+---+---+
| 101|138|282|220|
| 102| 96|212|123|
| 105| 37| 78| 60|
| 109| 59| 97| 87|
+-----+---+---+---+
With Spark-SQL, you'll have to specify each week value manually and then find a max
from it :
select
store,
max(case when week = 1 then xcount else null end) as `1`,
max(case when week = 2 then xcount else null end) as `2`,
max(case when week = 3 then xcount else null end) as `3`
from
d
group by
store
Upvotes: 2