Nasreddin
Nasreddin

Reputation: 1657

How to filter a Spark dataframe by a boolean column?

I created a dataframe that has the following schema:

In [43]: yelp_df.printSchema()
root
 |-- business_id: string (nullable = true)
 |-- cool: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: integer (nullable = true)
 |-- id: string (nullable = true)
 |-- stars: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- type: string (nullable = true)
 |-- useful: integer (nullable = true)
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- full_address: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- neighborhoods: string (nullable = true)
 |-- open: boolean (nullable = true)
 |-- review_count: integer (nullable = true)
 |-- state: string (nullable = true)

I want to select only the records with the "open" column that is "true". The following command I run in PySpark returns nothing:

yelp_df.filter(yelp_df["open"] == "true").collect()

Upvotes: 20

Views: 87747

Answers (5)

Andrey Semakin
Andrey Semakin

Reputation: 2775

For those who wonder how to write the opposite condition (when the column is False), it is done this way:

from pyspark.sql import functions as F

filtered_df = df.filter(~F.col('my_bool_col'))

Upvotes: 8

X_Trust
X_Trust

Reputation: 817

from pyspark.sql import functions as F

filtered_df = df.filter(F.col('my_bool_col'))

Upvotes: 23

Luis Meraz
Luis Meraz

Reputation: 2536

Looks like you're on PySpark, from filters documentation:

filter(condition) - condition is a Column of types.BooleanType or a string of SQL expression.

Since open: boolean (nullable = true), the following works and avoids Flake8's E712 error:

yelp_df.filter(yelp_df["open"]).collect()

Upvotes: 2

user11428312
user11428312

Reputation: 41

In Spark - Scala, I can think of two approaches Approach 1 :Spark sql command to get all the bool columns by creating a temporary view and selecting only Boolean columns from the whole dataframe. However this requires Boolean columns to be determined or fteching columsn from schema based on data type

    //define bool columns 
    val SqlBoolCols ="'boolcolumn1','boolcolumn2','boolcolumn3' 

    dataframe.createOrReplaceTempView("Booltable")
    val dfwithboolcolumns = sqlcontext.sql(s"Select ${SqlBoolCols} from Booltable")  

Approach 2 : Filter the dataframe if schema is defined

val strcolnames = rawdata.schema.fields.filter(x=>x.dataType == StringType).map(strtype=>strtype.name)   
val strdataframe= rawdata.select(strcolnames.head,strcolnames.tail:_*)

Upvotes: 3

Akshat Mahajan
Akshat Mahajan

Reputation: 9846

You're comparing data types incorrectly. open is listed as a Boolean value, not a string, so doing yelp_df["open"] == "true" is incorrect - "true" is a string.

Instead you want to do

yelp_df.filter(yelp_df["open"] == True).collect()

This correctly compares the values of open against the Boolean primitive True, rather than the non-Boolean string "true".

Upvotes: 26

Related Questions