eml
eml

Reputation: 439

Getting error when using where() or filter() on Dataframe

I want to check if the value in the Dataframe Column first_id is in a python list of ids that I have, if it is then it should pass the filter.

first_id_list = [1,2,3,4,5,6,7,8,9]

other_ids = id_dataframe.where(ids["first_id"] in first_id_list).select("other_id")

I'm writing in python, id_dataframe is a PySpark Dataframe and first_id_list is a python list of integers.

The error I'm getting is:

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Upvotes: 0

Views: 384

Answers (1)

Pierre Gourseaud
Pierre Gourseaud

Reputation: 2477

There's a problem in this expression : ids["first_id"] in first_id_list

ids["first_id"] is a Pyspark Column. first_id_list is a Python list.

where() Pyspark Dataframe method require a Boolean column to evaluate, but you give it a wrong python boolean expression.

You must use the Pyspark Column method isin() (documentation :https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin)

Answer :

other_ids = id_dataframe.where(ids["first_id"].isin(first_id_list)).select("other_id")

Now ids["first_id"].isin(first_id_list) is a DataFrame boolean expression returning a boolean column.

Upvotes: 2

Related Questions