Reputation: 181

Replace Special characters of column names in Spark dataframe

I have my input spark-dataframe named df as,

+---------------+----------------+-----------------------+
|Main_CustomerID|126+ Concentrate|2.5 Ethylhexyl_Acrylate|
+---------------+----------------+-----------------------+
|         725153|             3.0|                    2.0|
|         873008|             4.0|                    1.0|
|         625109|             1.0|                    0.0|
+---------------+----------------+-----------------------+

I need to remove the special characters from the column names of df like following,

Remove +
Replace space as underscore
Replace dot as underscore

So my df should be like

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|         725153|            3.0|                    2.0|
|         873008|            4.0|                    1.0|
|         625109|            1.0|                    0.0|
+---------------+---------------+-----------------------+

Using Scala, I have achieved this by,

var tableWithColumnsRenamed = df

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\+", ""))
    }
for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

df = tableWithColumnsRenamed

When I used,

for (field <- tableWithColumnsRenamed.columns) {
      tableWithColumnsRenamed = tableWithColumnsRenamed
        .withColumnRenamed(field, field.replaceAll("\\.", "_"))
    .withColumnRenamed(field, field.replaceAll("\\+", ""))
    .withColumnRenamed(field, field.replaceAll(" ", "_"))
    }

I got the first column name as 126 Concentrate instead of getting 126_Concentrate

But I don't prefer 3 for loops for this replacement. Can I get the solution?

Upvotes: 3

Answers (5)

akgarg511

Reputation: 1

We can remove all the characters just by mapping column_name with new name after replacing special characters using replaceAll for the respective character and this single line of code is tried and tested with spark scala.

df.select(
          df.columns
            .map(colName => col(s"`${colName}`").as(colName.replaceAll("\\.", "_").replaceAll(" ", "_"))): _*
        ).show(false)

Upvotes: 0

kevin_theinfinityfund

Reputation: 2157

Piggybacking Ramesh's answer, here is a reusable function using the currying syntax with the .transform() method & makes the columns lower case:

// Format all column names with regex with lower_case names
def formatAllColumns(regex_string:String)(df: DataFrame): DataFrame = {
  val replacingColumns = df.columns.map(regex_string.r.replaceAllIn(_, "_"))
  val resultDF:DataFrame = replacingColumns.zip(df.columns).foldLeft(df){
    (tempdf, name) => tempdf.withColumnRenamed(name._2, name._1.toLowerCase())
  }
  resultDF
}
val resultDF = df.transform(formatAllColumns(regex_string="""[+._(), ]+"""))

Upvotes: 0

NiharGht

Reputation: 161

In java you can iterate over column names using df.columns() and replace each header string with string replaceAll(regexPattern, IntendedCharreplacement)

Then use withColumnRenamed(headerName, correctedHeaderName) to rename df header.

eg -

for (String headerName : dataset.columns()) {
    String correctedHeaderName = headerName.replaceAll(" ","_").replaceAll("+","_");
    dataset = dataset.withColumnRenamed(headerName, correctedHeaderName);
}
dataset.show();

Upvotes: 0

Chandan Ray

Reputation: 2091

df
  .columns
  .foldLeft(df){(newdf, colname) =>
    newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
  }
  .show

Upvotes: 11

Ramesh Maharjan

Reputation: 41957

You can use withColumnRenamed regex replaceAllIn and foldLeft as below

val columns = df.columns

val regex = """[+._, ]+"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "_"))

val resultDF = replacingColumns.zip(columns).foldLeft(df){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}

resultDF.show(false)

which should give you

+---------------+---------------+-----------------------+
|Main_CustomerID|126_Concentrate|2_5_Ethylhexyl_Acrylate|
+---------------+---------------+-----------------------+
|725153         |3.0            |2.0                    |
|873008         |4.0            |1.0                    |
|625109         |1.0            |0.0                    |
+---------------+---------------+-----------------------+

I hope the answer is helpful

Upvotes: 8

Replace Special characters of column names in Spark dataframe

Answers (5)

Related Questions