Vero
Vero

Reputation: 459

"Begin" & "End" date based on values in a row

I have an example of input data that can be found here

input enter image description here

I need to add 2 columns: "Begin_date" & "End_date" based on the data in each row:

Example of output:

enter image description here

I will appreciate any ideas :) Thank you

Upvotes: 2

Views: 214

Answers (2)

jezrael
jezrael

Reputation: 862431

Idea is convert non datetimelike columns to MultiIndex by DataFrame.set_index and then convert columns to datetimes:

df = pd.read_excel('input.xlsx')

df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')

Then create new columns by DataFrame.assign, filter January columns, compare non missing values and get first value by DataFrame.idxmax, then convert to output format by Series.dt.strftime for begin, for end first swap order with indexing ::-1 and get last non missing values, convert to last days of month and last get default value if last column is not missing value by Series.where:

begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1).dt.strftime('%Y%m%d')
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()

end = end1.dt.strftime('%Y%m%d').where(df.iloc[:, -1].isna(), '99991231')

df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end =  end).reset_index()

print (df)
       id  region  201801  201802  201803  201804  201805  201806  201807  \
0  100001     628     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
1  100002    1149    27.0    24.0    27.0    25.0    24.0    26.0    27.0   
2  100003    1290    26.0    26.0    26.0    26.0    23.0    27.0    27.0   
3  100004     955    25.0    26.0    26.0    24.0    24.0    26.0    28.0   
4  100005    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   
5  100006    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   

   201808  ...  201811  201812  201901  201902  201903  201904  201905  \
0     NaN  ...      24    20.0    26.0    24.0    26.0    26.0    26.0   
1    28.0  ...      24    21.0    26.0    25.0    27.0    24.0    26.0   
2     NaN  ...      28     NaN    28.0    26.0    27.0    27.0    27.0   
3    27.0  ...      24    12.0     NaN     NaN     NaN     NaN     NaN   
4    26.0  ...      25     NaN     NaN     NaN     NaN     NaN     NaN   
5    26.0  ...      25    23.0    25.0    17.0     NaN     NaN     NaN   

   201906  date_begin  date_end  
0    23.0    20190101  99991231  
1    24.0    20180101  99991231  
2    25.0    20180101  99991231  
3     NaN    20180101  20181231  
4     NaN    20180101  20181130  
5     NaN    20180101  20190228  

[6 rows x 22 columns]

Also is possible create valid datatimes in both new column by Timestamp.max with Timestamp.floor:

df = pd.read_excel('input.xlsx')

df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')

begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1)
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()

end = end1.where(df.iloc[:, -1].isna(), pd.Timestamp.max.floor('d'))

df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()

print (df)
       id  region  201801  201802  201803  201804  201805  201806  201807  \
0  100001     628     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
1  100002    1149    27.0    24.0    27.0    25.0    24.0    26.0    27.0   
2  100003    1290    26.0    26.0    26.0    26.0    23.0    27.0    27.0   
3  100004     955    25.0    26.0    26.0    24.0    24.0    26.0    28.0   
4  100005    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   
5  100006    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   

   201808  ...  201811  201812  201901  201902  201903  201904  201905  \
0     NaN  ...      24    20.0    26.0    24.0    26.0    26.0    26.0   
1    28.0  ...      24    21.0    26.0    25.0    27.0    24.0    26.0   
2     NaN  ...      28     NaN    28.0    26.0    27.0    27.0    27.0   
3    27.0  ...      24    12.0     NaN     NaN     NaN     NaN     NaN   
4    26.0  ...      25     NaN     NaN     NaN     NaN     NaN     NaN   
5    26.0  ...      25    23.0    25.0    17.0     NaN     NaN     NaN   

   201906  date_begin   date_end  
0    23.0  2019-01-01 2262-04-11  
1    24.0  2018-01-01 2262-04-11  
2    25.0  2018-01-01 2262-04-11  
3     NaN  2018-01-01 2018-12-31  
4     NaN  2018-01-01 2018-11-30  
5     NaN  2018-01-01 2019-02-28  

[6 rows x 22 columns]

Upvotes: 2

tawab_shakeel
tawab_shakeel

Reputation: 3739

use pd.melt()
sort data by id and date

import pandas as pd
import numpy as np
from pandas.tseries.offsets import MonthEnd

df = pd.read_excel("input.xlsx")
max_date = df.columns[-1]

res = pd.melt(df, id_vars=['id', 'region'], value_vars=df.columns[2:])
res.dropna(subset=['value'], inplace=True)

res.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)

minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')

minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)

df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')

df['end_date'] = np.where(df['end_date']==max_date,
                          "99991231",df['end_date'])

df['start_date'] = (pd.to_datetime(df['start_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = (pd.to_datetime(df['end_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)

df['end_date'] = np.where(df['end_date']=='NaT',
                          "99991231",df['end_date'])
print(df)

      id  region  201801  201802  ...  201905  201906  start_date    end_date
0  100001     628     NaN     NaN  ...    26.0    23.0  2018-09-30    99991231
1  100002    1149    27.0    24.0  ...    26.0    24.0  2018-01-31    99991231
2  100003    1290    26.0    26.0  ...    27.0    25.0  2018-01-31    99991231
3  100004     955    25.0    26.0  ...     NaN     NaN  2018-01-31  2018-12-31
4  100005    1397    15.0    25.0  ...     NaN     NaN  2018-01-31  2018-11-30
5  100006    1397    15.0    25.0  ...     NaN     NaN  2018-01-31  2019-02-28

Upvotes: 3

Related Questions