Justin C.
Justin C.

Reputation: 382

Searching for substring across multiple columns

I am trying to find a substring across all columns of my spark dataframe using PySpark. I currently know how to search for a substring through one column using filter and contains:

df.filter(df.col_name.contains('substring'))

How do I extend this statement, or utilize another, to search through multiple columns for substring matches?

Upvotes: 1

Views: 1830

Answers (2)

Retko
Retko

Reputation: 382

You can search through all columns and fill next dataframe and union results, like this:

columns = ["language", "else"]
data = [
    ("Java", "Python"),
    ("Python", "100000"),
    ("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show()

schema = df.schema
df2 = spark.createDataFrame(data=[], schema=schema)

for col in df.columns:
    df2 = df2.unionByName(df.filter(df[col].like("%Python%")))

df2.show()
+--------+------+
|language|  else|
+--------+------+
|  Python|100000|
|    Java|Python|
+--------+------+

Result will contain first 2 rows, because they have value 'Python' in some of the columns.

Upvotes: 1

pissall
pissall

Reputation: 7399

You can generalize the statement the filter in one go:

from pyspark.sql.functions import col, count, when
# Converts all unmatched filters to NULL and drops them.
df = df.select([when(col(c).contains('substring'), col(c)).alias(c) for c in df.columns]).na.drop()

OR

You can simply loop over the columns and apply the same filter:

for col in df.columns:
    df = df.filter(df[col].contains("substring"))

Upvotes: 3

Related Questions