Reputation: 516
I have a DataFrame with fields which contain a period. When I attempt to use select() on them the Spark cannot resolve them, likely because '.' is used for accessing nested fields.
Here's the error:
enrichData.select("google.com") org.apache.spark.sql.AnalysisException: cannot resolve 'google.com' given input columns google.com, yahoo.com, ....
Is there a way to access these columns? Or an easy way to change the column names (as I can't select them, how can I change the names?).
Upvotes: 2
Views: 2981
Reputation: 812
Having a period in column name makes spark assume it as Nested field, field in a field. To counter that, you need to use a backtick "`". This should work:
scala> val df = Seq(("yr", 2000), ("pr", 12341234)).toDF("x.y", "e")
df: org.apache.spark.sql.DataFrame = [x.y: string, e: int]
scala> df.select("`x.y`").show
+---+
|x.y|
+---+
| yr|
| pr|
+---+
you need to put a backtick(`)
Upvotes: 4
Reputation: 13927
You can drop the schema and recreate it without the periods like this:
val newEnrichData = sqlContext.createDataFrame(
enrichData.rdd,
StructType(enrichData.schema.fields.map(sf =>
StructField(sf.name.replace(".", ""), sf.dataType, sf.nullable)
))
)
Upvotes: 1