Sam Comber
Sam Comber

Reputation: 1293

Similar pyspark logic returns different number of rows in dataframe

Why do the following two lines of code produce a different result?

email_response.filter(f"first_response_date > date'2020-11-2'")

The above returns 176671203 rows.

email_response.filter(F.col("first_response_date") > F.lit("2020-11-2")).count()

The above returns 52063066 rows.

The logic appears identical, why do the results differ?

Upvotes: 0

Views: 90

Answers (1)

mck
mck

Reputation: 42352

The second line is comparing the column to a string "2020-11-2", not a date. If you add a .cast("date") to the second line, I guess you will get the same answer.

email_response.filter(F.col("first_response_date") > F.lit("2020-11-2").cast("date")).count()

Upvotes: 1

Related Questions