Reputation: 459
I have an example of input data that can be found here
I need to add 2 columns: "Begin_date" & "End_date" based on the data in each row:
Example of output:
I will appreciate any ideas :) Thank you
Upvotes: 2
Views: 214
Reputation: 862431
Idea is convert non datetimelike columns to MultiIndex
by DataFrame.set_index
and then convert columns to datetimes:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
Then create new columns by DataFrame.assign
, filter January
columns, compare non missing values and get first value by DataFrame.idxmax
, then convert to output format by Series.dt.strftime
for begin
, for end
first swap order with indexing ::-1
and get last non missing values, convert to last days of month and last get default value if last column is not missing value by Series.where
:
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1).dt.strftime('%Y%m%d')
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.dt.strftime('%Y%m%d').where(df.iloc[:, -1].isna(), '99991231')
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 20190101 99991231
1 24.0 20180101 99991231
2 25.0 20180101 99991231
3 NaN 20180101 20181231
4 NaN 20180101 20181130
5 NaN 20180101 20190228
[6 rows x 22 columns]
Also is possible create valid datatimes in both new column by Timestamp.max
with Timestamp.floor
:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1)
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.where(df.iloc[:, -1].isna(), pd.Timestamp.max.floor('d'))
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 2019-01-01 2262-04-11
1 24.0 2018-01-01 2262-04-11
2 25.0 2018-01-01 2262-04-11
3 NaN 2018-01-01 2018-12-31
4 NaN 2018-01-01 2018-11-30
5 NaN 2018-01-01 2019-02-28
[6 rows x 22 columns]
Upvotes: 2
Reputation: 3739
use pd.melt()
sort data by id and date
import pandas as pd
import numpy as np
from pandas.tseries.offsets import MonthEnd
df = pd.read_excel("input.xlsx")
max_date = df.columns[-1]
res = pd.melt(df, id_vars=['id', 'region'], value_vars=df.columns[2:])
res.dropna(subset=['value'], inplace=True)
res.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)
minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')
minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)
df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')
df['end_date'] = np.where(df['end_date']==max_date,
"99991231",df['end_date'])
df['start_date'] = (pd.to_datetime(df['start_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = (pd.to_datetime(df['end_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = np.where(df['end_date']=='NaT',
"99991231",df['end_date'])
print(df)
id region 201801 201802 ... 201905 201906 start_date end_date
0 100001 628 NaN NaN ... 26.0 23.0 2018-09-30 99991231
1 100002 1149 27.0 24.0 ... 26.0 24.0 2018-01-31 99991231
2 100003 1290 26.0 26.0 ... 27.0 25.0 2018-01-31 99991231
3 100004 955 25.0 26.0 ... NaN NaN 2018-01-31 2018-12-31
4 100005 1397 15.0 25.0 ... NaN NaN 2018-01-31 2018-11-30
5 100006 1397 15.0 25.0 ... NaN NaN 2018-01-31 2019-02-28
Upvotes: 3