Patterson
Patterson

Reputation: 2759

PySpark Replace Characters using regex and remove column on Databricks

I am tring to remove a column and special characters from the dataframe shown below.

The code below used to create the dataframe is as follows:

dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')

The above produces the following output:

enter image description here

I need help with regex to remove the characters  and delete the first column.

As regards regex, I have tried the following:

dt.withColumn('COUNTRY ID', regexp_replace('COUNTRY ID', @"[^0-9a-zA-Z_]+"_ ""))

However, I'm getting a syntax error.

Any help much appreciated.

Upvotes: 0

Views: 1077

Answers (2)

wwnde
wwnde

Reputation: 26676

You have read in the data as a pandas dataframe. From what I see, you want a spark dataframe. Convert from pandas to spark and rename columns. That will dropn pandas default index column which in your case you refer to as first column. You then can rename the columns. Code below

df=spark.createDataFrame(df).toDF('COUNTRY',' COUNTRY NAME').show()

Upvotes: 0

Anjaneya Tripathi
Anjaneya Tripathi

Reputation: 1459

If the position of incoming column is fixed you can use regex to remove extra characters from column name like below


import re

colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)

And for dropping index column you can refer to this stack answer

Upvotes: 1

Related Questions