pankaj
pankaj

Reputation: 460

How to add new rows of values for a distinct column value in pandas

I have data frame like

 ORDER    STATUS   DATE
23412      200      7-2-2020
23412      300      8-2-2020
23412      400      10-2-2020
91234      300      8-2-2020
91234      400      9-2-2020
671234     200      10-3-2020

I want add static row for each distinct order with status =600 and date =31-12-9999

Expected output

ORDER    STATUS   DATE
23412      200      7-2-2020
23412      300      8-2-2020
23412      400      10-2-2020
23412      600      31-12-9999
91234      300      8-2-2020
91234      400      9-2-2020
91234      600      31-12-9999
671234     200      10-3-2020
671234    600      31-12-9999

How can this be done in pandas ?

Upvotes: 1

Views: 63

Answers (3)

jezrael
jezrael

Reputation: 862671

Use DataFrame.drop_duplicates with DataFrame.assign for new DataFrame, add to original by concat, sort index values by DataFrame.sort_index with only stable algo mergesort and last convert index to default RangeIndex by DataFrame.reset_index with drop=True:

df1 = df.drop_duplicates('ORDER', keep='last').assign(STATUS=600, DATE='31-12-9999')
df = pd.concat([df, df1]).sort_index(kind='mergesort').reset_index(drop=True)
print (df)
    ORDER  STATUS        DATE
0   23412     200    7-2-2020
1   23412     300    8-2-2020
2   23412     400   10-2-2020
3   23412     600  31-12-9999
4   91234     300    8-2-2020
5   91234     400    9-2-2020
6   91234     600  31-12-9999
7  671234     200   10-3-2020
8  671234     600  31-12-9999

There is more solutions, each is different - @Quang Hoang sorting data (maybe problem, maybe not), @sammywemmy and my solution not sorting data. Also groupby is obviously slow, so if performance is important better is avoid (if possible):

#some sample data, 100krows, 10k groups
np.random.seed(123)
N = 100000
L = ['7-2-2020', '8-2-2020', '10-2-2020', '8-2-2020', '9-2-2020', '10-3-2020']

df = pd.DataFrame({'ORDER': np.random.randint(10000, size=N),
                   'STATUS': np.random.randint(500, size=N),
                   'DATE':np.random.choice(L, N)}).sort_values('ORDER').reset_index(drop=True)
print (df)

In [391]: %timeit pd.concat([df, pd.DataFrame({'ORDER':df.ORDER.unique(), 'STATUS':600,'DATE':'31-12-9999'})],ignore_index=True).sort_values(['ORDER','STATUS'])
47.9 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [392]: %timeit pd.concat([df, df.drop_duplicates('ORDER', keep='last').assign(STATUS=600, DATE='31-12-9999')]).sort_index(kind='mergesort').reset_index(drop=True)
34.1 ms ± 543 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [393]: %timeit pd.concat([group.append({'ORDER':name,'STATUS':600, 'DATE':'31-12-9999'}, ignore_index=True)  for name,group in df.groupby('ORDER')],ignore_index=True )
24 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#some sample data, 100krows, 100 groups
np.random.seed(123)
N = 100000
L = ['7-2-2020', '8-2-2020', '10-2-2020', '8-2-2020', '9-2-2020', '10-3-2020']

df = pd.DataFrame({'ORDER': np.random.randint(100, size=N),
                   'STATUS': np.random.randint(500, size=N),
                   'DATE':np.random.choice(L, N)}).sort_values('ORDER').reset_index(drop=True)
print (df)

In [398]: %timeit pd.concat([df, pd.DataFrame({'ORDER':df.ORDER.unique(), 'STATUS':600,'DATE':'31-12-9999'})],ignore_index=True).sort_values(['ORDER','STATUS'])
31 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [399]: %timeit pd.concat([df, df.drop_duplicates('ORDER', keep='last').assign(STATUS=600, DATE='31-12-9999')]).sort_index(kind='mergesort').reset_index(drop=True)
28 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [400]: %timeit pd.concat([group.append({'ORDER':name,'STATUS':600, 'DATE':'31-12-9999'}, ignore_index=True)  for name,group in df.groupby('ORDER')],ignore_index=True )
290 ms ± 46.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 2

sammywemmy
sammywemmy

Reputation: 28644

Another option is to iterate through the groups, append the name of each grouping, and concatenate :

pd.concat([group.append({'ORDER':name,
                         'STATUS':600,
                         'DATE':'31-12-9999'},
                ignore_index=True) 
           for name,group in df.groupby('ORDER')],
          ignore_index=True
          )


    ORDER   STATUS  DATE
0   23412   200     7-2-2020
1   23412   300     8-2-2020
2   23412   400     10-2-2020
3   23412   600     31-12-9999
4   91234   300     8-2-2020
5   91234   400     9-2-2020
6   91234   600     31-12-9999
7   671234  200     10-3-2020
8   671234  600     31-12-9999

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150745

Similar to jezrael's answer, but using pd.Series.unique() and sort_values:

(pd.concat([df, pd.DataFrame({'ORDER':df.ORDER.unique(), 
                             'STATUS':600,
                             'DATE':'31-12-9999'})],
          ignore_index=True)
  .sort_values(['ORDER','STATUS'])
)

Output:

    ORDER  STATUS        DATE
0   23412     200    7-2-2020
1   23412     300    8-2-2020
2   23412     400   10-2-2020
6   23412     600  31-12-9999
3   91234     300    8-2-2020
4   91234     400    9-2-2020
7   91234     600  31-12-9999
5  671234     200   10-3-2020
8  671234     600  31-12-9999

Upvotes: 0

Related Questions