lythic
lythic

Reputation: 516

Accessing column names with periods - Spark SQL 1.3

I have a DataFrame with fields which contain a period. When I attempt to use select() on them the Spark cannot resolve them, likely because '.' is used for accessing nested fields.

Here's the error:

enrichData.select("google.com") org.apache.spark.sql.AnalysisException: cannot resolve 'google.com' given input columns google.com, yahoo.com, ....

Is there a way to access these columns? Or an easy way to change the column names (as I can't select them, how can I change the names?).

Upvotes: 2

Views: 2981

Answers (2)

Sarath Chandra Vema
Sarath Chandra Vema

Reputation: 812

Having a period in column name makes spark assume it as Nested field, field in a field. To counter that, you need to use a backtick "`". This should work:

scala> val df = Seq(("yr", 2000), ("pr", 12341234)).toDF("x.y", "e")
df: org.apache.spark.sql.DataFrame = [x.y: string, e: int]

scala> df.select("`x.y`").show
+---+
|x.y|
+---+
| yr|
| pr|
+---+

you need to put a backtick(`)

Upvotes: 4

David Griffin
David Griffin

Reputation: 13927

You can drop the schema and recreate it without the periods like this:

val newEnrichData = sqlContext.createDataFrame(
  enrichData.rdd, 
  StructType(enrichData.schema.fields.map(sf => 
    StructField(sf.name.replace(".", ""), sf.dataType, sf.nullable)
  ))
)

Upvotes: 1

Related Questions