user
user

Reputation: 771

How to search through strings in Pyspark column and selectively replace some strings (containing specific substrings) with a variable?

I have a column in a Spark Dataframe that contains a list of strings. I am hoping to do the following and am not sure how:

  1. Search the column for the presence of a substring, if this substring is present, replace that string with a word.
  2. If one of the desired substrings is not present, then replace the string with 'other'

Sample SDF:

data = [
    [1, ["EQUCAB", "EQUCAS", "CANMA"]],
    [2, ["CANMA", "FELCAT", "SUSDO"]],
    [3, ["BOSTAU", "EQUCAB"]],
]

df = pd.DataFrame(data, columns=["Item", "String"])
df["String"] = [",".join(map(str, l)) for l in df["String"]]

sdf = spark.createDataFrame(df)

Desired output:

data = [
    [1, ["horse", "horse", "other"]],
    [2, ["other", "cat", "other"]],
    [3, ["cow", "horse"]],
]

df = pd.DataFrame(data, columns=["Item", "String"])
df["String"] = [",".join(map(str, l)) for l in df["String"]]

so basically, any element that contains EQU is assigned horse, any element that contains FEL is assigned cat, any element that contains BOS is assigned cow, and the rest are assigned other.

Upvotes: 0

Views: 455

Answers (1)

Robert Kossendey
Robert Kossendey

Reputation: 6998

from pyspark.sql.functions import when, col, lit

df = df.withColum("String",
                    when(col('String').contains('EQU'), lit('horse'))
                    .when(col('String').contains('FEL'), lit('cat'))
                    .when(col('String').contains('BOS'), lit('cow')).otherwise(lit('other')))

Upvotes: 1

Related Questions