iratelilkid
iratelilkid

Reputation: 105

Filtering a dataframe in pyspark

I have a dataframe that looks like this,

[Row(cardholder_nbr=u'10', category_nbr=87, club_nbr=6279, membership_nbr=u'780941902', retail_all=51.23, sub_category_nbr=1, system_item_nbr=3572924, unit_qty=1.0, visit_date=u'2016-08-05', visit_nbr=1608057826021405)]

I want to filter it based on a visit date range between say 2016-09-15 to 2016-09-23. How do I do it? If I do it like this,

df = sqlContext.sql("SELECT * from df_table WHERE visit_date >= '2016-09-15' AND visit_date <= '2016-09-23'")

I get this error,

"invalid literal for int()"

How do I filter it properly?

Upvotes: 1

Views: 7131

Answers (1)

Prem
Prem

Reputation: 11985

Can you try this and let us know if it works?

from pyspark.sql import Row
import pyspark.sql.functions as f

data = [Row(cardholder_nbr=u'10', category_nbr=87, club_nbr=6279, membership_nbr=u'780941902', retail_all=51.23, sub_category_nbr=1, system_item_nbr=3572924, unit_qty=1.0, visit_date=u'2016-08-05', visit_nbr=1608057826021405),
        Row(cardholder_nbr=u'10', category_nbr=87, club_nbr=6279, membership_nbr=u'780941902', retail_all=51.23, sub_category_nbr=1, system_item_nbr=3572924, unit_qty=1.0, visit_date=u'2016-09-16', visit_nbr=1608057826021405)]
df = sc.parallelize(data).toDF()
df.show()
df.where((f.col("visit_date") >= '2016-09-15') & (f.col("visit_date") <= '2016-09-23')).show()

Upvotes: 3

Related Questions