Conditional replace of special characters in pyspark dataframe

Question

I have as data frame df in pyspark. It has columns like eng hours, eng_hours, test apt, test.rest and so on...

I want to replace spaces and dot in column names with underscore(_).

If after replace the column if there are any duplicates then return the column names in which we replace the character and concatenate it.

For example: In the above data frame we have two columns eng hours and eng_hours. Now after we replace the space with underscore in the first column we will get eng_hours, which will be a duplicate to the second column eng_hours. When this situation arises then return the first column as enghours.

How can we achieve that in pyspark.

Suresh · Accepted Answer

Did a small change to include column name duplicate part, please try this,

>>> from pyspark.sql.types import *
>>>import re
>>> l=[('val1','val2','val3'),('val4','val5','val6')]
>>> l_schema = StructType([StructField("eng hours",StringType(),True),StructField("eng_hours",StringType(),True),StructField("test.apt",StringType(),True)])
>>> rdd = sc.parallelize(l)
>>> df = sqlContext.createDataFrame(rdd,l_schema)
>>> reps=('.','_'),(' ','_')
>>> df.printSchema()
root
 |-- eng hours: string (nullable = true)
 |-- eng_hours: string (nullable = true)
 |-- test.apt: string (nullable = true)

>>> colnames = df.schema.names

>>> def colrename(x):
...      newcol = reduce(lambda a,kv : a.replace(*kv),reps,x)
...      return re.sub('[. ]','',x) if newcol in colnames else newcol

>>> for i in colnames:
...       df = df.withColumnRenamed(i,colrename(i))
>>> df.printSchema()
root
 |-- enghours: string (nullable = true)
 |-- eng_hours: string (nullable = true)
 |-- test_apt: string (nullable = true)

Conditional replace of special characters in pyspark dataframe

Answers (1)

Related Questions