Reputation: 379
Attempting to create new column Tax_Year
with appropriate tax year, formed by checking whether datetimes in date
column are within the boundaries of the tuple elements for the individual txYear_
...
salesReport = pd.DataFrame({'date': ['2017-07-02 09:00:00', '2017-07-03 15:00:00', '2018-04-05 15:00:00',
'2018-12-20 11:00:00', '2019-01-06 14:00:00', '2020-09-06 17:00:00'],
'sales': [100, 339, 98, 1020, 630, 765]})
txYear_0304 = (dt.datetime(2003, 4, 6), dt.datetime(2004, 4, 5))
txYear_0405 = (dt.datetime(2004, 4, 6), dt.datetime(2005, 4, 5))
txYear_0506 = (dt.datetime(2005, 4, 6), dt.datetime(2006, 4, 5))
txYear_0607 = (dt.datetime(2006, 4, 6), dt.datetime(2007, 4, 5))
txYear_0708 = (dt.datetime(2007, 4, 6), dt.datetime(2008, 4, 5))
txYear_0809 = (dt.datetime(2008, 4, 6), dt.datetime(2009, 4, 5))
txYear_0910 = (dt.datetime(2009, 4, 6), dt.datetime(2010, 4, 5))
txYear_1011 = (dt.datetime(2010, 4, 6), dt.datetime(2011, 4, 5))
txYear_1112 = (dt.datetime(2011, 4, 6), dt.datetime(2012, 4, 5))
txYear_1213 = (dt.datetime(2012, 4, 6), dt.datetime(2013, 4, 5))
txYear_1314 = (dt.datetime(2013, 4, 6), dt.datetime(2014, 4, 5))
txYear_1415 = (dt.datetime(2014, 4, 6), dt.datetime(2015, 4, 5))
txYear_1516 = (dt.datetime(2015, 4, 6), dt.datetime(2016, 4, 5))
txYear_1617 = (dt.datetime(2016, 4, 6), dt.datetime(2017, 4, 5))
txYear_1718 = (dt.datetime(2017, 4, 6), dt.datetime(2018, 4, 5))
txYear_1819 = (dt.datetime(2018, 4, 6), dt.datetime(2019, 4, 5))
txYear_1920 = (dt.datetime(2019, 4, 6), dt.datetime(2020, 4, 5))
txYear_2021 = (dt.datetime(2020, 4, 6), dt.datetime(2021, 4, 5))
tax_year = [txYear_0304, txYear_0405, txYear_0506, txYear_0607, txYear_0708, txYear_0809, txYear_0910, txYear_1011, txYear_1112,
txYear_1213, txYear_1314, txYear_1415, txYear_1516, txYear_1617, txYear_1718, txYear_1819, txYear_1920, txYear_2021]
When this condition is meet I would like to have the variable name appear in the appropriate row of the new column
For Example
date sales Tax_Year
0 2017-07-02 09:00:00 100 txYear_1617
1 2017-07-03 15:00:00 339 txYear_1617
2 2018-04-05 15:00:00 98 txYear_1718
3 2018-12-20 11:00:00 1020 txYear_1819
4 2019-01-06 14:00:00 630 txYear_1819
5 2020-09-06 17:00:00 765 txYear_2021
I have approached this problem using np.where
....
salesReport['Tax_Year'] = np.where(tax_year[0] <= salesReport['date'] and tax_year[1] >= salesReport['date'], tax_year, np.nan)
However, I cannot resolve the error which i recieve...
TypeError: '>=' not supported between instances of 'str' and 'tuple'
In addition, I am also unsure of how to get the variable name as at present I would be returning the actual tuple contents which is not what I want
Upvotes: 1
Views: 53
Reputation: 8962
I'm not proficient in Pandas. I wouldn't be surprised if there was a nicer way to do this.
I've converted the tax_years
list of tuples into a dictionary, and defined a standalone function to get the tax year of a given datetime object. I'm not actually 100% what time of day the tax year ends/begins, so the comparison is only on MM-DD-YY and removes the time from the timestamps that exist in the dataframe.
import pandas as pd
import numpy as np
import datetime
tax_years = {
(datetime.datetime(2003, 4, 6), datetime.datetime(2004, 4, 5)): "TY0304",
(datetime.datetime(2004, 4, 6), datetime.datetime(2005, 4, 5)): "TY0405",
(datetime.datetime(2005, 4, 6), datetime.datetime(2006, 4, 5)): "TY0506",
(datetime.datetime(2006, 4, 6), datetime.datetime(2007, 4, 5)): "TY0607",
(datetime.datetime(2007, 4, 6), datetime.datetime(2008, 4, 5)): "TY0708",
(datetime.datetime(2008, 4, 6), datetime.datetime(2009, 4, 5)): "TY0809",
(datetime.datetime(2009, 4, 6), datetime.datetime(2010, 4, 5)): "TY0910",
(datetime.datetime(2010, 4, 6), datetime.datetime(2011, 4, 5)): "TY1011",
(datetime.datetime(2011, 4, 6), datetime.datetime(2012, 4, 5)): "TY1112",
(datetime.datetime(2012, 4, 6), datetime.datetime(2013, 4, 5)): "TY1213",
(datetime.datetime(2013, 4, 6), datetime.datetime(2014, 4, 5)): "TY1314",
(datetime.datetime(2014, 4, 6), datetime.datetime(2015, 4, 5)): "TY1415",
(datetime.datetime(2015, 4, 6), datetime.datetime(2016, 4, 5)): "TY1516",
(datetime.datetime(2016, 4, 6), datetime.datetime(2017, 4, 5)): "TY1617",
(datetime.datetime(2017, 4, 6), datetime.datetime(2018, 4, 5)): "TY1718",
(datetime.datetime(2018, 4, 6), datetime.datetime(2019, 4, 5)): "TY1819",
(datetime.datetime(2019, 4, 6), datetime.datetime(2020, 4, 5)): "TY1920",
(datetime.datetime(2020, 4, 6), datetime.datetime(2021, 4, 5)): "TY2021"
}
salesReport = pd.DataFrame({'date': ['2017-07-02 09:00:00',
'2017-07-03 15:00:00',
'2018-04-05 15:00:00',
'2018-12-20 11:00:00',
'2019-01-06 14:00:00',
'2020-09-06 17:00:00'],
'sales': [100, 339, 98, 1020, 630, 765]})
salesReport["date"] = pd.to_datetime(salesReport["date"])
def get_tax_year(date):
for (start, end), tax_year in tax_years.items():
if start.date() <= date.date() <= end.date():
return tax_year
return "null"
salesReport["tax_year"] = [get_tax_year(date) for date in salesReport["date"]]
print(salesReport)
And the output:
date sales tax_year
0 2017-07-02 09:00:00 100 TY1718
1 2017-07-03 15:00:00 339 TY1718
2 2018-04-05 15:00:00 98 TY1718
3 2018-12-20 11:00:00 1020 TY1819
4 2019-01-06 14:00:00 630 TY1819
5 2020-09-06 17:00:00 765 TY2021
Upvotes: 1