Reputation: 1013
I have a large dataset and some columns have String data-type. Because of typo mistake, some of the cells have None values but written in different styles (with small or capital letters, with or without space, with or without bracket, etc).
I want to find all those values and convert them into null. A sample dataset is below:
data = [("A", "None", 1), \
("A", "(None)", 2), \
("A", "none", 3), \
("A", "[None]", 4), \
("A", "(none)", 5), \
("A", "(none", 6), \
("A", "none ", 7), \
(" NOne ", "B", 8), \
]
# Create DataFrame
columns= ["col_1", "col_2", "Number"]
df = spark.createDataFrame(data = data, schema = columns)
Any idea how to do that?
Upvotes: 2
Views: 177
Reputation: 26676
I assume there are no alphanumerics expected in col_2 Make all values in col_2, upper case. Replace non aphanumerics with nothing Remove leading and trailing spaces Use df.replace to replace NONE with null
Code below.
df.withColumn('col_2', trim(regexp_replace(upper('col_2'),'\W',''))).na.replace('NONE',None).show()
Upvotes: 1