Reputation: 439
I want to check if the value in the Dataframe Column first_id
is in a python list of ids that I have, if it is then it should pass the filter.
first_id_list = [1,2,3,4,5,6,7,8,9]
other_ids = id_dataframe.where(ids["first_id"] in first_id_list).select("other_id")
I'm writing in python, id_dataframe
is a PySpark Dataframe and first_id_list
is a python list of integers.
The error I'm getting is:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Upvotes: 0
Views: 384
Reputation: 2477
There's a problem in this expression : ids["first_id"] in first_id_list
ids["first_id"]
is a Pyspark Column. first_id_list
is a Python list.
where()
Pyspark Dataframe method require a Boolean column to evaluate, but you give it a wrong python boolean expression.
You must use the Pyspark Column method isin()
(documentation :https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin)
Answer :
other_ids = id_dataframe.where(ids["first_id"].isin(first_id_list)).select("other_id")
Now ids["first_id"].isin(first_id_list)
is a DataFrame boolean expression returning a boolean column.
Upvotes: 2