Reputation: 49
I am a beginner in python and stuck on translating a piece of code from SAS to Python (see below). How would I recreate the "intnx"-command in line 10 in Python, which introduces a 3-month-lag to the actual date (datadate) in order to create the variable "HouDate".
data data._1_compustat_prep (keep = gvkey tic datadate fyear HouDate HouYear ib at dvc act che
lct dlc txp dp ibc oancf);
retain gvkey tic datadate fyear HouDate HouYear;
set data._0_compustat_students;
if indfmt = 'INDL' and datafmt = 'STD';
format HouDate DDMMYY10.;
format DataDate DDMMYY10.;
HouDate = intnx("month", datadate, +3, "E");
if month(HouDate) >= 7 then HouYear = year(HouDate) + 1; else HouYear = year(HouDate);
if mdy(07,01,1963) <= HouDate <= mdy(06,30,2009);
if exchg in (11,12,14,15,17);
if 6000 <= sich <= 6999 then delete;
run;
Upvotes: 1
Views: 664
Reputation: 1185
You can use the datetime
and [dateutil
] libraries to help you with this. In particular the datetime.datetime
and datetime.timedelta
dateutil.realtivedelta.relativedelta
classes.
The dateutil.relativedelta.relativedelta
class is better than the datetime.timedelta
here, because the latter won't allow you to specify a delta with units bigger than days, when the former does.
Additionally, I've added type hints in the code, to make it easier to understand. I corresponds to the syntax xx:yy
where xx
is the variable name, and yy
its type.
from datetime import datetime
from dateutil.relativedelta import relativedelta
lag:relativedelta = relativedelta(months=3)
datadate:datetime = datetime(year=2022, month=8, day=25, hour=11, minute=37, second=8)
HouDate:datetime = datadate + lag
HouYear:int = HouDate.year
if HouDate.month >= 7:
HouYear += 1
print("{} || {}".format(HouYear, HouDate))
2023 || 2022-11-25 11:37:08
As the author mentioned in one of their comment:
how would you do the commands above for whole columns in a dataframe?
Here is the way I'd do it when using a pandas.DataFrame
:
from datetime import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd
lag:relativedelta = relativedelta(months=3)
df:pd.DataFrame = pd.DataFrame({
"datadate": [
datetime(year=2022, month=1, day=1, hour=1, minute=1, second=1),
datetime(year=2022, month=6, day=2, hour=2, minute=2, second=2),
datetime(year=2022, month=10, day=3, hour=3, minute=3, second=3)
]
})
df["HouDate"] = df.datadate.apply(lambda x: x+lag)
df["HouYear"] = df.HouDate.apply(lambda x: x.year if x.month < 7 else x.year+1)
print(df)
Pre-treatment dataframe:
datadate | |
---|---|
0 | 2022-01-01 01:01:01 |
1 | 2022-06-02 02:02:02 |
2 | 2022-10-03 03:03:03 |
Post-treatment dataframe:
datadate | HouDate | HouYear | |
---|---|---|---|
0 | 2022-01-01 01:01:01 | 2022-04-01 01:01:01 | 2022 |
1 | 2022-06-02 02:02:02 | 2022-09-02 02:02:02 | 2023 |
2 | 2022-10-03 03:03:03 | 2023-01-03 03:03:03 | 2023 |
Upvotes: 2