Luke
Luke

Reputation: 7089

Pyspark replace strings in Spark dataframe column

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address
1       2 foo lane
2       10 bar lane
3       24 pants ln

Would become

id     address
1       2 foo ln
2       10 bar ln
3       24 pants ln

Upvotes: 83

Views: 302554

Answers (4)

Michał Jabłoński
Michał Jabłoński

Reputation: 1277

In Spark 3.5 they introduced the replace function which accepts Column arguments, and is pretty efficient.

Works like this:

df = spark.createDataFrame([("ABCabc", "abc", "DEF",)], ["a", "b", "c"])
df.select(replace(df.a, df.b, df.c).alias('r')).show()

Upvotes: 0

Daniel de Paula
Daniel de Paula

Reputation: 17872

For Spark 1.5 or later, you can use the functions package:

from pyspark.sql.functions import regexp_replace
newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Quick explanation:

  • The function withColumn is called to add (or replace, if the name exists) a column to the data frame.
  • The function regexp_replace will generate a new column by replacing all substrings that match the pattern.

Upvotes: 189

Aishwarya
Aishwarya

Reputation: 11

My suggestion is to import the sql function package and make use of withColumn function to modify the existing column in the df. In this case we need to replace address column data having lane as ln.

from pyspark.sql.functions import *

replacedf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Upvotes: 0

loneStar
loneStar

Reputation: 4010

For scala

import org.apache.spark.sql.functions.regexp_replace
import org.apache.spark.sql.functions.col
data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", ""))

Upvotes: 11

Related Questions