Shakile
Shakile

Reputation: 343

How to remove special character in dataset columns' name

Anyone knows how to remove special character from Dataset columns name in Spark Java?

I would like to replace "_" by " " (See the example below).

Input : (df_in)

+--------------+-----------------+------------+------------+
|PARTITION_DATE|       date_start|   dt_con_ID|   dt_fin_ID|
+--------------+-----------------+------------+------------+
|    2020/03/03|2019-10-14 00:00:| 11000295001|   110100156|

Output desired : (df_out)

    +--------------+-----------------+------------+------------+
    |PARTITION DATE|       date start|   dt con ID|   dt fin ID|
    +--------------+-----------------+------------+------------+
    |    2020/03/03|2019-10-14 00:00:| 11000295001|   110100156|

I tried to get this output with the code :

String[] colsToRename = df_in.columns();
for (String headerName : df_in.columns()) {
    Dataset<Row> df_out = df_in.withColumnRenamed(headerName, headerName.replaceAll("_", " "));
    df_out.show();
}

But with this, I got only the last column name modified

        +--------------+-----------------+------------+------------+
        |PARTITION_DATE|       date_start|   dt_con_ID|   dt fin ID|
        +--------------+-----------------+------------+------------+
        |    2020/03/03|2019-10-14 00:00:| 11000295001|   110100156|

Upvotes: 0

Views: 418

Answers (3)

notNull
notNull

Reputation: 31490

Use .toDF() again dataframe with new column names.

Example:

val df=Seq((1,2,3,4)).toDF("PARTITION_DATE","date_start","dt_con_id","dt_fin_id")

df.toDF(df.columns.map(x => x.replaceAll("_"," ")):_*).show()

//+--------------+----------+---------+---------+
//|PARTITION DATE|date start|dt con id|dt fin id|
//+--------------+----------+---------+---------+
//|             1|         2|        3|        4|
//+--------------+----------+---------+---------+

Upvotes: 0

z_1_p
z_1_p

Reputation: 409

scala> val data = Seq((1,2,3),(1,2,3)).toDF("A_a","B_b","C_c")
scala> data.columns.foldLeft(data)((df,column)=> df.withColumnRenamed(column, column.replaceAll("_"," "))
scala> res1.show
+---+---+---+
|A a|B b|C c|
+---+---+---+
|  1|  2|  3|
|  1|  2|  3|
+---+---+---+

Try something like above.

Upvotes: 0

Carlos Vilchez
Carlos Vilchez

Reputation: 2804

Each time the loop runs, the program changes a different column name (only one) of df_in and puts the value updated in df_out. But you want all of them changed at the same time.

Try this:

String[] colsToRename = df_in.columns();

Dataset<Row> df_out = df_in;
for (String headerName : df_out.columns()) {
   Dataset<Row> df_out = df_out.withColumnRenamed(headerName, headerName.replaceAll("_", " "));
}
df_out.show();

Upvotes: 1

Related Questions