Reputation: 13
This is my dataframe:
ID number | Date purchase |
---|---|
1 | 2022-05-01 |
1 | 2021-03-03 |
1 | 2020-01-03 |
2 | 2019-01-03 |
2 | 2018-01-03 |
I want to get a horizontal dataframe with alle the dates in seperate columns per ID number.
So like this:
ID number | Date 1 | Date 2 | Date 3 |
---|---|---|---|
1 | 2022-05-01 | 2021-03-03 | 2020-01-03 |
2 | 2019-01-03 | 2018-01-03 |
After I did this I want to calculate the difference between these dates.
Upvotes: 1
Views: 153
Reputation: 1
initial situation:
d = {'IdNumber': [1,1,1,2,2], 'Date': ['2022-05-01', '2021-03-03','2020-01-03','2019-01-03','2018-01-03']}
df = pd.DataFrame(data=d)
date conversion:
df['Date'] = pd.to_datetime(df['Date'])
creating new column:
df1=df.assign(Col=lambda x: x.groupby('IdNumber').cumcount())
pivoting:
df1=df1.pivot(index=["IdNumber"],columns=["Col"],values="Date")
reset index:
df1 = df1.reset_index(level=0)
rename column:
for i in range(1,len(df1.columns)):
df1.columns.values[i]='Date{0}'.format(i)
final result:
Col IdNumber Date1 Date2 Date3
0 1 2022-05-01 2021-03-03 2020-01-03
1 2 2019-01-03 2018-01-03 NaT
Upvotes: 0
Reputation: 862471
First step is GroupBy.cumcount
with DataFrame.pivot
:
df['Date purchase'] = pd.to_datetime(df['Date purchase'])
df1 = (df.sort_values(by=['ID number', 'Date purchase'], ascending=[True, False])
.assign(g=lambda x: x.groupby('ID number').cumcount())
.pivot('ID number','g','Date purchase')
.rename(columns = lambda x: f'Date {x + 1}'))
print (df1)
g Date 1 Date 2 Date 3
ID number
1 2022-05-01 2021-03-03 2020-01-03
2 2019-01-03 2018-01-03 NaT
Then for differencies between columns use DataFrame.diff
:
df2 = df1.diff(-1, axis=1)
print (df2)
g Date 1 Date 2 Date 3
ID number
1 424 days 425 days NaT
2 365 days NaT NaT
If need averages:
df3 = df1.apply(pd.Series.mean, axis=1).reset_index(name='Avg Dates').rename_axis(None, axis=1)
print (df3)
ID number Avg Dates
0 1 2021-03-02 16:00:00
1 2 2018-07-04 12:00:00
Upvotes: 1
Reputation: 2120
Could you do something like this?
def format_dataframe(df):
"""
Function formats the dataframe to the following:
| ID number| Date 1 | Date 2 | Date 3 |
| -------- | -------------- | -------------- | -------------- |
| 1 | 2022-05-01 | 2021-03-03 | 2020-01-03 |
| 2 | 2019-01-03 | 2018-01-03 | |
"""
df = df.sort_values(by=['ID number', 'Date purchase'])
df = df.drop_duplicates(subset=['ID number'], keep='first')
df = df.drop(columns=['Date purchase'])
df = df.rename(columns={'ID number': 'ID number', 'Date 1': 'Date 1', 'Date 2': 'Date 2', 'Date 3': 'Date 3'})
return df
Upvotes: 0