Paul Reiners
Paul Reiners

Reputation: 7894

Scala Apache Spark: Nonstandard characters in column names

I'm calling the following:

  propertiesDF.select(
        col("timestamp"), col("coordinates")(0) as "lon", 
        col("coordinates")(1) as "lat", 
        col("properties.tide (above mllw)") as "tideAboveMllw",
        col("properties.wind speed") as "windSpeed")

This gives me the following error:

org.apache.spark.sql.AnalysisException: No such struct field tide (above mllw) in air temperature, atmospheric pressure, dew point, dominant wave period, mean wave direction, name, program name, significant wave height, tide (above mllw):, visibility, water temperature, wind direction, wind speed;

Now there definitely is such a struct field. (The error message itself says so.)

Here is the schema:

 root
 |-- timestamp: long (nullable = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- properties: struct (nullable = true)
 |    |-- air temperature: double (nullable = true)
 |    |-- atmospheric pressure: double (nullable = true)
 |    |-- dew point: double (nullable = true)
          .
          .
          .
 |    |-- tide (above mllw):: string (nullable = true)
          .
          .
          .

The input is read as JSON like this:

val df = sqlContext.read.json(dirName)

How do I handle parentheses in a column name?

Upvotes: 0

Views: 1620

Answers (2)

zero323
zero323

Reputation: 330373

You should avoid names like this in the first place but you can either split access path:

val df = spark.range(1).select(struct(
  lit(123).as("tide (above mllw)"),
  lit(1).as("wind speed")
).as("properties"))

df.select(col("properties").getItem("tide (above mllw)"))

// or

df.select(col("properties")("tide (above mllw)"))

or enclose problematic field with backticks:

df.select(col("properties.`tide (above mllw)`"))

Both solutions assume data your data contains following structure (based on the access path you use for queries):

df.printSchema
// root
//  |-- properties: struct (nullable = false)
//  |    |-- tide (above mllw): integer (nullable = false)
//  |    |-- wind speed: integer (nullable = false)

Upvotes: 2

Aleksandar Stojadinovic
Aleksandar Stojadinovic

Reputation: 5049

Based on the documentation you might try with single quotes. Like this:

 propertiesDF.select(
        col("timestamp"), col("coordinates")(0) as "lon", 
        col("coordinates")(1) as "lat", 
        col("'properties.tide (above mllw)'") as "tideAboveMllw",
        col("properties.wind speed") as "windSpeed")

Upvotes: 0

Related Questions