Mohammad
Mohammad

Reputation: 1013

How to find cells with None values (string data-type) and replace them with Null in Spark DataFrame?

I have a large dataset and some columns have String data-type. Because of typo mistake, some of the cells have None values but written in different styles (with small or capital letters, with or without space, with or without bracket, etc).

I want to find all those values and convert them into null. A sample dataset is below:

data = [("A", "None", 1), \
    ("A", "(None)", 2), \
    ("A", "none", 3), \
    ("A", "[None]", 4), \
    ("A", "(none)", 5), \
    ("A", "(none", 6), \
    ("A", "none  ", 7), \
    (" NOne  ", "B", 8), \
  ]

# Create DataFrame
columns= ["col_1", "col_2", "Number"]
df = spark.createDataFrame(data = data, schema = columns)

enter image description here

Any idea how to do that?

Upvotes: 2

Views: 177

Answers (1)

wwnde
wwnde

Reputation: 26676

I assume there are no alphanumerics expected in col_2 Make all values in col_2, upper case. Replace non aphanumerics with nothing Remove leading and trailing spaces Use df.replace to replace NONE with null

Code below.

 df.withColumn('col_2', trim(regexp_replace(upper('col_2'),'\W',''))).na.replace('NONE',None).show()

Upvotes: 1

Related Questions