Reputation: 7089
I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?
In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:
id address
1 2 foo lane
2 10 bar lane
3 24 pants ln
Would become
id address
1 2 foo ln
2 10 bar ln
3 24 pants ln
Upvotes: 83
Views: 302554
Reputation: 1277
In Spark 3.5 they introduced the replace
function which accepts Column
arguments, and is pretty efficient.
Works like this:
df = spark.createDataFrame([("ABCabc", "abc", "DEF",)], ["a", "b", "c"])
df.select(replace(df.a, df.b, df.c).alias('r')).show()
Upvotes: 0
Reputation: 17872
For Spark 1.5 or later, you can use the functions package:
from pyspark.sql.functions import regexp_replace
newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))
Quick explanation:
withColumn
is called to add (or replace, if the name exists) a column to the data frame.regexp_replace
will generate a new column by replacing all substrings that match the pattern.Upvotes: 189
Reputation: 11
My suggestion is to import the sql function
package and make use of withColumn
function to modify the existing column in the df. In this case we need to replace address column data having lane as ln
.
from pyspark.sql.functions import *
replacedf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))
Upvotes: 0
Reputation: 4010
For scala
import org.apache.spark.sql.functions.regexp_replace
import org.apache.spark.sql.functions.col
data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", ""))
Upvotes: 11