Using dictionary in regexp_replace function in pyspark

Question

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.

Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....} The dictionary will have around 270 key value pair.

Input Dataframe:

ID  | Address    
1   | 22, COLLINS RD     
2   | 11, HEMINGWAY DR    
3   | AVIATOR BUILDING    
4   | 33, PARK AVE MULLOHAND DR

Desired Output Dataframe:

ID   | Address  | Address_Clean    
1    | 22, COLLINS RD    | 22, COLLINS ROAD    
2    | 11, HEMINGWAY DR     | 11, HEMINGWAY DRIVE    
3    | AVIATOR BUILDING      | AVIATOR BUILDING    
4    | 33, PARK AVE MULLOHAND DR    | 33, PARK AVENUE MULLOHAND DRIVE

I cannot find any documentation on internet. And if trying to pass dictionary as below codes-

data=data.withColumn('Address_Clean',regexp_replace('Address',dict))

Throws an error "regexp_replace takes 3 arguments, 2 given".

Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf. Is there any efficient method of doing it other than may be using a loop?

Using dictionary in regexp_replace function in pyspark

Answers (1)

Related Questions