Reputation: 983
pyspark's 'between' function is not inclusive for timestamp input.
For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields
However, the document seem to hint that it is inclusive (no reference on timestamp though)
Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?
Example:
import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
+--------------------+-----+
Upvotes: 29
Views: 116039
Reputation: 171
Just to be clear, if you want to get data from a single date it's better to specify the exact time
ex) Retrieve data only on a single day (2017-04-13)
test_df.filter(F.col("start").between('2017-04-13 00:00:00','2017-04-13 23:59:59.59')
cf) if you set the date as between '2017-04-13', '2017-04-14' this will include 2017-04-14 00:00:00 data also, which technically isn't the data you want to pull out since it's 2017-04-14 data.
Upvotes: 4
Reputation: 1530
.between() method is always inclusive. The problem in your example is that when you pass string to .between() method, it treats your data as strings as well. For string comparison, '2017-04-14 00:00:00' is strictly greater than '2017-04-14' because the former is a longer string than the latter, this is why the second date is filtered out in your example. To avoid the "inconsistency", you should pass arguments in datetime format to .between() as follows:
filtered_df = (test_df.filter(F.col("start")
.between(dt.strptime('2017-04-13 12:00:00', '%Y-%m-%d %H:%M:%S'),
dt.strptime('2017-04-14 00:00:00', '%Y-%m-%d %H:%M:%S'))))
This will produce the expected result:
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
Upvotes: 12
Reputation: 983
Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.
For the above example, here is the output for exclusive search (use pd.to_datetime):
test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:
test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
Upvotes: 32