Reputation: 1657
I created a dataframe that has the following schema:
In [43]: yelp_df.printSchema()
root
|-- business_id: string (nullable = true)
|-- cool: integer (nullable = true)
|-- date: string (nullable = true)
|-- funny: integer (nullable = true)
|-- id: string (nullable = true)
|-- stars: integer (nullable = true)
|-- text: string (nullable = true)
|-- type: string (nullable = true)
|-- useful: integer (nullable = true)
|-- user_id: string (nullable = true)
|-- name: string (nullable = true)
|-- full_address: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- neighborhoods: string (nullable = true)
|-- open: boolean (nullable = true)
|-- review_count: integer (nullable = true)
|-- state: string (nullable = true)
I want to select only the records with the "open" column that is "true". The following command I run in PySpark returns nothing:
yelp_df.filter(yelp_df["open"] == "true").collect()
Upvotes: 20
Views: 87747
Reputation: 2775
For those who wonder how to write the opposite condition (when the column is False), it is done this way:
from pyspark.sql import functions as F
filtered_df = df.filter(~F.col('my_bool_col'))
Upvotes: 8
Reputation: 817
from pyspark.sql import functions as F
filtered_df = df.filter(F.col('my_bool_col'))
Upvotes: 23
Reputation: 2536
Looks like you're on PySpark, from filters
documentation:
filter(condition)
- condition is a Column of types.BooleanType or a string of SQL expression.
Since open: boolean (nullable = true)
, the following works and avoids Flake8's E712 error:
yelp_df.filter(yelp_df["open"]).collect()
Upvotes: 2
Reputation: 41
In Spark - Scala, I can think of two approaches Approach 1 :Spark sql command to get all the bool columns by creating a temporary view and selecting only Boolean columns from the whole dataframe. However this requires Boolean columns to be determined or fteching columsn from schema based on data type
//define bool columns
val SqlBoolCols ="'boolcolumn1','boolcolumn2','boolcolumn3'
dataframe.createOrReplaceTempView("Booltable")
val dfwithboolcolumns = sqlcontext.sql(s"Select ${SqlBoolCols} from Booltable")
Approach 2 : Filter the dataframe if schema is defined
val strcolnames = rawdata.schema.fields.filter(x=>x.dataType == StringType).map(strtype=>strtype.name)
val strdataframe= rawdata.select(strcolnames.head,strcolnames.tail:_*)
Upvotes: 3
Reputation: 9846
You're comparing data types incorrectly. open
is listed as a Boolean value, not a string, so doing yelp_df["open"] == "true"
is incorrect - "true"
is a string.
Instead you want to do
yelp_df.filter(yelp_df["open"] == True).collect()
This correctly compares the values of open
against the Boolean primitive True
, rather than the non-Boolean string "true"
.
Upvotes: 26