Reputation: 343
Anyone knows how to remove special character from Dataset columns name in Spark Java?
I would like to replace "_" by " " (See the example below).
Input : (df_in)
+--------------+-----------------+------------+------------+
|PARTITION_DATE| date_start| dt_con_ID| dt_fin_ID|
+--------------+-----------------+------------+------------+
| 2020/03/03|2019-10-14 00:00:| 11000295001| 110100156|
Output desired : (df_out)
+--------------+-----------------+------------+------------+
|PARTITION DATE| date start| dt con ID| dt fin ID|
+--------------+-----------------+------------+------------+
| 2020/03/03|2019-10-14 00:00:| 11000295001| 110100156|
I tried to get this output with the code :
String[] colsToRename = df_in.columns();
for (String headerName : df_in.columns()) {
Dataset<Row> df_out = df_in.withColumnRenamed(headerName, headerName.replaceAll("_", " "));
df_out.show();
}
But with this, I got only the last column name modified
+--------------+-----------------+------------+------------+
|PARTITION_DATE| date_start| dt_con_ID| dt fin ID|
+--------------+-----------------+------------+------------+
| 2020/03/03|2019-10-14 00:00:| 11000295001| 110100156|
Upvotes: 0
Views: 418
Reputation: 31490
Use .toDF()
again dataframe with new column names.
Example:
val df=Seq((1,2,3,4)).toDF("PARTITION_DATE","date_start","dt_con_id","dt_fin_id")
df.toDF(df.columns.map(x => x.replaceAll("_"," ")):_*).show()
//+--------------+----------+---------+---------+
//|PARTITION DATE|date start|dt con id|dt fin id|
//+--------------+----------+---------+---------+
//| 1| 2| 3| 4|
//+--------------+----------+---------+---------+
Upvotes: 0
Reputation: 409
scala> val data = Seq((1,2,3),(1,2,3)).toDF("A_a","B_b","C_c")
scala> data.columns.foldLeft(data)((df,column)=> df.withColumnRenamed(column, column.replaceAll("_"," "))
scala> res1.show
+---+---+---+
|A a|B b|C c|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
Try something like above.
Upvotes: 0
Reputation: 2804
Each time the loop runs, the program changes a different column name (only one) of df_in and puts the value updated in df_out. But you want all of them changed at the same time.
Try this:
String[] colsToRename = df_in.columns();
Dataset<Row> df_out = df_in;
for (String headerName : df_out.columns()) {
Dataset<Row> df_out = df_out.withColumnRenamed(headerName, headerName.replaceAll("_", " "));
}
df_out.show();
Upvotes: 1