Reputation: 460
I have data frame like
ORDER STATUS DATE
23412 200 7-2-2020
23412 300 8-2-2020
23412 400 10-2-2020
91234 300 8-2-2020
91234 400 9-2-2020
671234 200 10-3-2020
I want add static row for each distinct order
with status =600 and date =31-12-9999
Expected output
ORDER STATUS DATE
23412 200 7-2-2020
23412 300 8-2-2020
23412 400 10-2-2020
23412 600 31-12-9999
91234 300 8-2-2020
91234 400 9-2-2020
91234 600 31-12-9999
671234 200 10-3-2020
671234 600 31-12-9999
How can this be done in pandas ?
Upvotes: 1
Views: 63
Reputation: 862671
Use DataFrame.drop_duplicates
with DataFrame.assign
for new DataFrame
, add to original by concat
, sort index values by DataFrame.sort_index
with only stable algo mergesort
and last convert index to default RangeIndex
by DataFrame.reset_index
with drop=True
:
df1 = df.drop_duplicates('ORDER', keep='last').assign(STATUS=600, DATE='31-12-9999')
df = pd.concat([df, df1]).sort_index(kind='mergesort').reset_index(drop=True)
print (df)
ORDER STATUS DATE
0 23412 200 7-2-2020
1 23412 300 8-2-2020
2 23412 400 10-2-2020
3 23412 600 31-12-9999
4 91234 300 8-2-2020
5 91234 400 9-2-2020
6 91234 600 31-12-9999
7 671234 200 10-3-2020
8 671234 600 31-12-9999
There is more solutions, each is different - @Quang Hoang sorting data (maybe problem, maybe not), @sammywemmy and my solution not sorting data. Also groupby
is obviously slow, so if performance is important better is avoid (if possible):
#some sample data, 100krows, 10k groups
np.random.seed(123)
N = 100000
L = ['7-2-2020', '8-2-2020', '10-2-2020', '8-2-2020', '9-2-2020', '10-3-2020']
df = pd.DataFrame({'ORDER': np.random.randint(10000, size=N),
'STATUS': np.random.randint(500, size=N),
'DATE':np.random.choice(L, N)}).sort_values('ORDER').reset_index(drop=True)
print (df)
In [391]: %timeit pd.concat([df, pd.DataFrame({'ORDER':df.ORDER.unique(), 'STATUS':600,'DATE':'31-12-9999'})],ignore_index=True).sort_values(['ORDER','STATUS'])
47.9 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [392]: %timeit pd.concat([df, df.drop_duplicates('ORDER', keep='last').assign(STATUS=600, DATE='31-12-9999')]).sort_index(kind='mergesort').reset_index(drop=True)
34.1 ms ± 543 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [393]: %timeit pd.concat([group.append({'ORDER':name,'STATUS':600, 'DATE':'31-12-9999'}, ignore_index=True) for name,group in df.groupby('ORDER')],ignore_index=True )
24 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#some sample data, 100krows, 100 groups
np.random.seed(123)
N = 100000
L = ['7-2-2020', '8-2-2020', '10-2-2020', '8-2-2020', '9-2-2020', '10-3-2020']
df = pd.DataFrame({'ORDER': np.random.randint(100, size=N),
'STATUS': np.random.randint(500, size=N),
'DATE':np.random.choice(L, N)}).sort_values('ORDER').reset_index(drop=True)
print (df)
In [398]: %timeit pd.concat([df, pd.DataFrame({'ORDER':df.ORDER.unique(), 'STATUS':600,'DATE':'31-12-9999'})],ignore_index=True).sort_values(['ORDER','STATUS'])
31 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [399]: %timeit pd.concat([df, df.drop_duplicates('ORDER', keep='last').assign(STATUS=600, DATE='31-12-9999')]).sort_index(kind='mergesort').reset_index(drop=True)
28 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [400]: %timeit pd.concat([group.append({'ORDER':name,'STATUS':600, 'DATE':'31-12-9999'}, ignore_index=True) for name,group in df.groupby('ORDER')],ignore_index=True )
290 ms ± 46.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 2
Reputation: 28644
Another option is to iterate through the groups, append the name of each grouping, and concatenate :
pd.concat([group.append({'ORDER':name,
'STATUS':600,
'DATE':'31-12-9999'},
ignore_index=True)
for name,group in df.groupby('ORDER')],
ignore_index=True
)
ORDER STATUS DATE
0 23412 200 7-2-2020
1 23412 300 8-2-2020
2 23412 400 10-2-2020
3 23412 600 31-12-9999
4 91234 300 8-2-2020
5 91234 400 9-2-2020
6 91234 600 31-12-9999
7 671234 200 10-3-2020
8 671234 600 31-12-9999
Upvotes: 0
Reputation: 150745
Similar to jezrael's answer, but using pd.Series.unique()
and sort_values
:
(pd.concat([df, pd.DataFrame({'ORDER':df.ORDER.unique(),
'STATUS':600,
'DATE':'31-12-9999'})],
ignore_index=True)
.sort_values(['ORDER','STATUS'])
)
Output:
ORDER STATUS DATE
0 23412 200 7-2-2020
1 23412 300 8-2-2020
2 23412 400 10-2-2020
6 23412 600 31-12-9999
3 91234 300 8-2-2020
4 91234 400 9-2-2020
7 91234 600 31-12-9999
5 671234 200 10-3-2020
8 671234 600 31-12-9999
Upvotes: 0