Reputation: 10203
I need to search for a value in all Spark DataFrame columns.
I've tried this;
for col_name in df.columns:
filter_df = df.where(df[col_name].like('%searchterm%'))
But there're no matches in filter_df
(filter.df.count() = 0
)
If I test it with a column name that I know contains the keyword, I do get results
Upvotes: 1
Views: 2691
Reputation: 10372
Check below code, Get all columns of type string & Create filter conditions on those columns.
stringCols = map(
lambda x: x.name,
filter(
lambda x: x.dataType.typeName() == 'string',
df.schema.fields
)
)# Get all string columns.
expr = reduce(
lambda x, y: (x | y),
map(lambda x: col(x).like('%ravi%'), stringCols)
)# Creating expressions, I have used `or`
df = spark.createDataFrame([
(1, 'srinivas', 'sse'),
(2, 'ravi', 'engineer'),
(3, 'Raju', 'ravi se')
],
['id', 'name', 'designation']
)
df.show()
+---+--------+--------------------+
| id| name| designation|
+---+--------+--------------------+
| 1|srinivas| sse |
| 2| ravi| engineer|
| 3| Raju| ravi se |
+---+--------+--------------------+
df.filter(expr).show() # Filter is check 'ravi' value in all string columns.
+---+----+--------------------+
| id|name| designation|
+---+----+--------------------+
| 2|ravi| engineer|
| 3|Raju| ravi se|
+---+----+--------------------+
Upvotes: 0
Reputation: 8410
You can do atleast_one, using coalesce
over when
statement columns.
from pyspark.sql import functions as F
df.filter(F.coalesce(*[F.when(F.col(x).like('%searchterm%'),F.lit(1)) for x in df.columns])==1)
Upvotes: 2
Reputation: 45319
The problem is that when you filter in cascade using your conditions, you end up building a data frame with data that has to meet each condition, as though you combined each column's condition with AND
.
You should build the condition separately with OR
, as in:
cols = df.columns
filt = df[cols[0]].like('%searchterm%')
for col in cols[1:]:
filt = filt | df[col].like('%searchterm%')
filter_df = df.where(filt)
Upvotes: 3