gaatjeniksaan
gaatjeniksaan

Reputation: 1431

Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'.

I have tried:

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

But this throws:

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly?

Upvotes: 81

Views: 255595

Answers (4)

Rakesh Chintha
Rakesh Chintha

Reputation: 705

You can try the following expression, which helps you search for multiple strings at the same time:

df.filter(""" location rlike 'google.com|amazon.com|github.com' """)

Upvotes: 1

mrsrinivas
mrsrinivas

Reputation: 35404

Spark 2.2 onwards

df.filter(df.location.contains('google.com'))

Spark 2.2 documentation link


Spark 2.1 and before

You can use plain SQL in filter

df.filter("location like '%google.com%'")

or with DataFrame column methods

df.filter(df.location.like('%google.com%'))

Spark 2.1 documentation link

Upvotes: 157

joaofbsm
joaofbsm

Reputation: 635

pyspark.sql.Column.contains() is only available in pyspark version 2.2 and above.

df.where(df.location.contains('google.com'))

Upvotes: 23

caffreyd
caffreyd

Reputation: 1203

When filtering a DataFrame with string values, I find that the pyspark.sql.functions lower and upper come in handy, if your data could have column entries like "foo" and "Foo":

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))

Upvotes: 8

Related Questions