pyspark dataframe column name

Question

what is limitation for pyspark dataframe column names. I have issue with following code ..

%livy.pyspark
df_context_spark.agg({'spatialElementLabel.value': 'count'})

It gives ...

u'Cannot resolve column name "spatialElementLabel.value" among (lightFixtureID.value, spatialElementLabel.value);'

The column name is evidently typed correctly. I got the dataframe by transformation from pandas dataframe. It there any issue with dot in the column name string?

Ryan Widmaier · Accepted Answer

Dots are used for nested fields inside a structure type. So if you had a column that was called "address" of type StructType, and inside that you had street1, street2, etc you would access it the individual fields like this:

df.select("address.street1", "address.street2", ..)

Because of that, if you want to used a dot in your field name you need to quote the field whenever you refer to it. For example:

from pyspark.sql.types import *

schema = StructType([StructField("my.field", StringType())])

rdd = sc.parallelize([('hello',), ('world',)])
df = sqlContext.createDataFrame(rdd, schema)

# Using backticks to quote the field name
df.select("`my.field`").show()

pyspark dataframe column name

Answers (1)

Related Questions