Searching for substring across multiple columns

Question

I am trying to find a substring across all columns of my spark dataframe using PySpark. I currently know how to search for a substring through one column using filter and contains:

df.filter(df.col_name.contains('substring'))

How do I extend this statement, or utilize another, to search through multiple columns for substring matches?

pissall · Accepted Answer

You can generalize the statement the filter in one go:

from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()

OR

You can simply loop over the columns and apply the same filter:

for col in df.columns:
    df = df.filter(df[col].contains("substring"))

Searching for substring across multiple columns

Answers (2)

Related Questions