Jared
Jared

Reputation: 26149

How can I reference a column with a hyphen in its name in a pyspark column expression?

I have a json document shaped like this (note that this schema isn't under my control - I can't go get rid of the hyphen in the key):

{
   "col1": "value1",
   "dictionary-a": {
      "col2": "value2"
   }
}

I use session.read.json(...) to read this json in to a dataframe (named 'df') like this:

df = session.read.json('/path/to/json.json')

I want to do this:

df2 = df.withColumn("col2", df.dictionary-a.col2)

I get the error:

AttributeError: 'DataFrame' object has no attribute 'dictionary'

How can I reference columns with hyphens in their names in pyspark column expressions?

Upvotes: 2

Views: 2433

Answers (1)

pault
pault

Reputation: 43504

As you have it, the hyphen in df.dictionary-a.col2 is being evaluated as subtraction: df.dictionary - a.col2.

Instead, you can use pyspark.sql.functions.col to refer to the column by name and pyspark.sql.Column.getItem to access an element of the dictionary by key.

Try:

from pyspark.sql.functions import col
df2 = df.withColumn("col2", col("dictionary-a").getItem("col2"))

Upvotes: 2

Related Questions