John Doe
John Doe

Reputation: 10203

Search for keyword in all DataFrame columns and filter

I need to search for a value in all Spark DataFrame columns.

I've tried this;

for col_name in df.columns:
  filter_df = df.where(df[col_name].like('%searchterm%'))

But there're no matches in filter_df (filter.df.count() = 0)

If I test it with a column name that I know contains the keyword, I do get results

Upvotes: 1

Views: 2691

Answers (3)

s.polam
s.polam

Reputation: 10372

Check below code, Get all columns of type string & Create filter conditions on those columns.

stringCols = map(
  lambda x: x.name,
  filter(
    lambda x: x.dataType.typeName() == 'string',
    df.schema.fields
  )
)# Get all string columns.

expr = reduce(
  lambda x, y: (x | y),
  map(lambda x: col(x).like('%ravi%'), stringCols)
)# Creating expressions, I have used `or`

df = spark.createDataFrame([
    (1, 'srinivas', 'sse'),
    (2, 'ravi', 'engineer'),
    (3, 'Raju', 'ravi se')
  ],
  ['id', 'name', 'designation']
)

df.show()
+---+--------+--------------------+
| id|    name|         designation|
+---+--------+--------------------+
|  1|srinivas|               sse  |
|  2|    ravi|            engineer|
|  3|    Raju|            ravi se |
+---+--------+--------------------+

df.filter(expr).show() # Filter is check 'ravi' value in all string columns.
+---+----+--------------------+
| id|name|         designation|
+---+----+--------------------+
|  2|ravi|            engineer|
|  3|Raju|             ravi se|
+---+----+--------------------+

Upvotes: 0

murtihash
murtihash

Reputation: 8410

You can do atleast_one, using coalesce over when statement columns.

from pyspark.sql import functions as F

df.filter(F.coalesce(*[F.when(F.col(x).like('%searchterm%'),F.lit(1)) for x in df.columns])==1)

Upvotes: 2

ernest_k
ernest_k

Reputation: 45319

The problem is that when you filter in cascade using your conditions, you end up building a data frame with data that has to meet each condition, as though you combined each column's condition with AND.

You should build the condition separately with OR, as in:

cols = df.columns
filt = df[cols[0]].like('%searchterm%')

for col in cols[1:]:
    filt = filt | df[col].like('%searchterm%')

filter_df = df.where(filt)

Upvotes: 3

Related Questions