Konrad Kostrzewa
Konrad Kostrzewa

Reputation: 835

pyspark column name afer pivot

On input I have DF similar to: +-----+-----+ |data1|data2| +-----+-----+ | 1.0| 0.33| | 1.0| 0| | 2.0| 0.33| | 1.0| 0| | 1.0| 0.33| | 2.0| 0.33| +-----+-----+

after performing pivot

pivot = df.groupBy('data1').pivot('data2').count()

structure looks like this:

+-----+----+----+ |data1| 0|0.33| +-----+----+----+ | 1.0| 2| 2| | 2.0|null| 2| +-----+----+----+

Attempting to do anything with column 0.33 results in

AnalysisException: Can't extract value from 0#1535L;

How to handle this case?

Upvotes: 0

Views: 3364

Answers (1)

titiro89
titiro89

Reputation: 2108

The problem is that your column name contains a dot. As you can see here:

The Spark SQL doesn’t support field names that contains dots

Solution 1

Rename columns with new names (new names have to be without dots):

There are many ways to do this, see this SO question, here I have put an example from that question:

>>> oldColumns = pivot.schema.names
>>> newColumns = ["data1","col1","col2"]
>>> newPivot = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), pivot)
>>> newPivot.show()
+-----+----+----+
|data1|col1|col2|
+-----+----+----+
|  1.0|   2|   2|
|  2.0|null|   2|
+-----+----+----+

Solution 2

Use backquote ( ` ) to select the column that have dots in its name (here an example):

>>> newPivot = pivot.groupBy().sum("`0.33`")
>>> newPivot.show()
+---------+
|sum(0.33)|
+---------+
|        4|
+---------+

Upvotes: 3

Related Questions