PySpark Replace Characters using regex and remove column on Databricks

Question

I am tring to remove a column and special characters from the dataframe shown below.

The code below used to create the dataframe is as follows:

dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')

The above produces the following output:

I need help with regex to remove the characters Ã¯Â»Â¿ and delete the first column.

As regards regex, I have tried the following:

dt.withColumn('Ã¯Â»Â¿COUNTRY ID', regexp_replace('Ã¯Â»Â¿COUNTRY ID', @"[^0-9a-zA-Z_]+"_ ""))

However, I'm getting a syntax error.

Any help much appreciated.

Anjaneya Tripathi · Accepted Answer

If the position of incoming column is fixed you can use regex to remove extra characters from column name like below


import re

colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)

And for dropping index column you can refer to this stack answer

PySpark Replace Characters using regex and remove column on Databricks

Answers (2)

Related Questions