Reputation: 2157
I have parquet dataset column names with spaces in between the words like for eg: BRANCH NAME
. Now, when I replace the space with "_"
and try printing the column it results in an error. Below is my code with multiple approaches followed by error:
Approach 1:
Var df= spark.read.parquet("s3://tvsc-lumiq-edl/raw-v2/LMSDB/DESUSR/TBL_DES_SLA_MIS1")
for (c <- df.columns){
df = df.withColumnRenamed(c, c.replace(" ", ""))
}
Approach 2:
df = df.columns.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "")))
Approach 3:
val new_cols = df.columns.map(x => x.replaceAll(" ", ""))
val df2 = df.toDF(new_cols : _*)
Error:
org.apache.spark.sql.AnalysisException: Attribute name "BRANCH NAME" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
Below is the schema:
scala> df.printSchema()
root
|-- dms_timestamp: string (nullable = true)
|-- BRANCH NAME: string (nullable = true)
|-- BRANCH CODE: string (nullable = true)
|-- DEALER NAME: string (nullable = true)
|-- DEALER CODE: string (nullable = true)
|-- DEALER CATEGORY: string (nullable = true)
|-- PRODUCT: string (nullable = true)
|-- CREATION DATE: string (nullable = true)
|-- CHANNEL TYPE: string (nullable = true)
|-- DELAY DAYS: string (nullable = true)
I have had also referred multiple SO posts but didn't help.
Upvotes: 0
Views: 823
Reputation: 748
Thanks for @Andrey 's update:
Now we can correctly load parquet files with column names containing these special characters. Just make sure you're using spark after version 3.3.0.
If all the datasets are in parquet files, I'm afraid we're out of luck and you have to load them in Pandas and then do the renaming.
Spark won't read parquet files with column names containing characters among " ,;{}()\n\t=
" at all. AFAIK, Spark devs refused to resolve this issue. The root cause of it lies in your parquet files themselves. At least according to the dev, parquet files should not have these "invalid characters" in their column names in the first place.
See https://issues.apache.org/jira/browse/SPARK-27442 . It was marked as "won't fix".
Upvotes: 1
Reputation: 60065
It was fixed in Spark 3.3.0 release (I tested). So your only option is to upgrade (or use pandas to rename the fields).
Upvotes: 1
Reputation: 51
This worked for me
val dfnew =df.select(df.columns.map(i => col(i).as(i.replaceAll(" ", ""))): _*)
Upvotes: 0
Reputation: 10362
Try below code.
df
.select(df.columns.map(c => col(s"`${c}`").as(c.replace(" ",""))):_*)
.show(false)
Upvotes: 0