Reputation: 1601
I am trying to filter my pyspark dataframe based on an OR condition like so:
filtered_df = file_df.filter(file_df.dst_name == "ntp.obspm.fr").filter(file_df.fw == "4940" | file_df.fw == "4960")
I want to return only rows where file_df.fw == "4940" OR file_df.fw == "4960" However when I try this I get this error:
Py4JError: An error occurred while calling o157.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist
What am I doing wrong?
Without the OR condition it works when I try to filter only on one condition (file_df.fw=="4940"
)
Upvotes: 2
Views: 8665
Reputation: 19365
The error message is caused by the different priorities of the operators. The |
(OR) has a higher priority as the comparison operator ==
. Spark tries to apply the OR on
"4940"
and file_df.fw
and not like you want it on (file_df.fw == "4940")
and (file_df.fw == "4960")
. You can change the priorities by using brackets. Have a look at the following example:
columns = ['dst_name','fw']
file_df=spark.createDataFrame([('ntp.obspm.fr','3000'),
('ntp.obspm.fr','4940'),
('ntp.obspm.fr','4960'),
('ntp.obspm.de', '4940' )],
columns)
#here I have added the brackets
filtered_df = file_df.filter(file_df.dst_name == "ntp.obspm.fr").filter((file_df.fw == "4940") | (file_df.fw == "4960"))
filtered_df.show()
Output:
+------------+----+
| dst_name| fw|
+------------+----+
|ntp.obspm.fr|4940|
|ntp.obspm.fr|4960|
+------------+----+
Upvotes: 3