whatsinthename
whatsinthename

Reputation: 2157

Unable to remove the space from column names in spark scala

I have parquet dataset column names with spaces in between the words like for eg: BRANCH NAME. Now, when I replace the space with "_" and try printing the column it results in an error. Below is my code with multiple approaches followed by error:

Approach 1:

Var df= spark.read.parquet("s3://tvsc-lumiq-edl/raw-v2/LMSDB/DESUSR/TBL_DES_SLA_MIS1")


for (c <- df.columns){
        df = df.withColumnRenamed(c, c.replace(" ", ""))
}

Approach 2:

df = df.columns.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "")))

Approach 3:

val new_cols =  df.columns.map(x => x.replaceAll(" ", "")) 

val df2 = df.toDF(new_cols : _*)   

Error:

org.apache.spark.sql.AnalysisException: Attribute name "BRANCH NAME" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;  

Below is the schema:

scala> df.printSchema()
root
 |-- dms_timestamp: string (nullable = true)
 |-- BRANCH NAME: string (nullable = true)
 |-- BRANCH CODE: string (nullable = true)
 |-- DEALER NAME: string (nullable = true)
 |-- DEALER CODE: string (nullable = true)
 |-- DEALER CATEGORY: string (nullable = true)
 |-- PRODUCT: string (nullable = true)
 |-- CREATION DATE: string (nullable = true)
 |-- CHANNEL TYPE: string (nullable = true)
 |-- DELAY DAYS: string (nullable = true)

I have had also referred multiple SO posts but didn't help.

Upvotes: 0

Views: 823

Answers (4)

Brandon
Brandon

Reputation: 748

Thanks for @Andrey 's update:

Now we can correctly load parquet files with column names containing these special characters. Just make sure you're using spark after version 3.3.0.


If all the datasets are in parquet files, I'm afraid we're out of luck and you have to load them in Pandas and then do the renaming.

Spark won't read parquet files with column names containing characters among " ,;{}()\n\t=" at all. AFAIK, Spark devs refused to resolve this issue. The root cause of it lies in your parquet files themselves. At least according to the dev, parquet files should not have these "invalid characters" in their column names in the first place.

See https://issues.apache.org/jira/browse/SPARK-27442 . It was marked as "won't fix".

Upvotes: 1

Andrey
Andrey

Reputation: 60065

It was fixed in Spark 3.3.0 release (I tested). So your only option is to upgrade (or use pandas to rename the fields).

Upvotes: 1

Priyanka
Priyanka

Reputation: 51

This worked for me

val dfnew =df.select(df.columns.map(i => col(i).as(i.replaceAll(" ", ""))): _*)

Upvotes: 0

s.polam
s.polam

Reputation: 10362

Try below code.

df
.select(df.columns.map(c => col(s"`${c}`").as(c.replace(" ",""))):_*)
.show(false)

Upvotes: 0

Related Questions