XaviorL
XaviorL

Reputation: 373

Pandas: Create a new row within each group with conditions

I have a dateframe (df),

df = pd.DataFrame({
    'ID': ['James', 'James', 'James','Max', 'Max', 'Max', 'Max','Park','Tom', 'Tom', 'Tom', 'Tom','Wong'],
    'From_num': [78, 420, 'Started', 298, 36, 298, 'Started', 'Started', 60, 520, 99, 'Started', 'Started'],
    'To_num': [96, 78, 420, 36, 78, 36, 298, 311, 150, 520, 78, 99, 39],
    'Date': ['2020-05-12', '2020-02-02', '2019-06-18',
             '2019-06-20', '2019-01-30', '2018-10-23',
             '2018-08-29', '2020-05-21', '2019-11-22',
             '2019-08-26', '2018-12-11', '2018-10-09', '2019-02-01']})

And it is like this:

       ID From_num  To_num        Date
0   James       78      96  2020-05-12
1   James      420      78  2020-02-02
2   James  Started     420  2019-06-18
3     Max      298      36  2019-06-20
4     Max       36      78  2019-01-30
5     Max      298      36  2018-10-23
6     Max  Started     298  2018-08-29
7    Park  Started     311  2020-05-21
8     Tom       60     150  2019-11-22
9     Tom      520     520  2019-08-26
10    Tom       99      78  2018-12-11
11    Tom  Started      99  2018-10-09
12   Wong  Started      39  2019-02-01

For each person (group), I wish to create a new duplicate row on the first row within each group ('ID'), the values for the created row in column'ID', 'From_num' and 'To_num' should be the same as the previous first row, but the 'Date' value is the old 1st row's Date plus one day e.g. for James, the newly created row values is: 'James' '78' '96' '2020-05-13', same as the rest data, so my expected result is:

       ID From_num  To_num        Date
0   James       78      96  2020-05-13  # row added, Date + 1
1   James       78      96  2020-05-12
2   James      420      78  2020-02-02
3   James  Started     420  2019-06-18
4     Max      298      36  2019-06-21  # row added, Date + 1
5     Max      298      36  2019-06-20
6     Max       36      78  2019-01-30
7     Max      298      36  2018-10-23
8     Max  Started     298  2018-08-29
9    Park  Started     311  2020-05-22  # Row added, Date + 1
10   Park  Started     311  2020-05-21
11    Tom       60     150  2019-11-23  # Row added, Date + 1
12    Tom       60     150  2019-11-22
13    Tom      520     520  2019-08-26
14    Tom       99      78  2018-12-11
15    Tom  Started      99  2018-10-09
16   Wong  Started      39  2019-02-02  # Row added Date + 1
17   Wong  Started      39  2019-02-01

I wish to the order/sequence is same as my expected result. If you have any good ideas, please help. Many thanks

Upvotes: 3

Views: 125

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Use:

df['Date'] = pd.to_datetime(df['Date'])
df['order'] = df.groupby('ID').cumcount().add(1)

df1 = (
    df.groupby('ID', as_index=False).first()
    .assign(Date=lambda x: x['Date'] + pd.Timedelta(days=1), order=0)
)

df1 = pd.concat([df, df1]).sort_values(['ID', 'order'], ignore_index=True).drop('order', 1)

Details:

Convert the Date column to pandas datetime series and use DataFrame.groupby on column ID and groupby.cumcount to impose the total ordering in each groups in the dataframe.

print(df)
       ID From_num  To_num       Date  order
0   James       78      96 2020-05-13      1
1   James       78      96 2020-05-12      2
2   James      420      78 2020-02-02      3
3   James  Started     420 2019-06-18      4
4     Max      298      36 2019-06-21      1
5     Max      298      36 2019-06-20      2
6     Max       36      78 2019-01-30      3
7     Max      298      36 2018-10-23      4
8     Max  Started     298 2018-08-29      5
9    Park  Started     311 2020-05-22      1
10   Park  Started     311 2020-05-21      2
11    Tom       60     150 2019-11-23      1
12    Tom       60     150 2019-11-22      2
13    Tom      520     520 2019-08-26      3
14    Tom       99      78 2018-12-11      4
15    Tom  Started      99 2018-10-09      5
16   Wong  Started      39 2019-02-02      1
17   Wong  Started      39 2019-02-01      2

Create a new dataframe df1 by using DataFrame.groupby on column ID and aggregate using groupby.first and assigning order=0 and incrementing Date by pd.Timedelta of 1 days.

print(df1)
      ID From_num  To_num       Date  order
0  James       78      96 2020-05-14      0 # Date incremented by 1 days
1    Max      298      36 2019-06-22      0 # and ordering added
2   Park  Started     311 2020-05-23      0
3    Tom       60     150 2019-11-24      0
4   Wong  Started      39 2019-02-03      0

Using pd.concat concat the dataframes df and df1 and use DataFrame.sort_values to sort the dataframe on columns ID and order.

print(df1)
       ID From_num  To_num       Date
0   James       78      96 2020-05-14
1   James       78      96 2020-05-13
2   James       78      96 2020-05-12
3   James      420      78 2020-02-02
4   James  Started     420 2019-06-18
5     Max      298      36 2019-06-22
6     Max      298      36 2019-06-21
7     Max      298      36 2019-06-20
8     Max       36      78 2019-01-30
9     Max      298      36 2018-10-23
10    Max  Started     298 2018-08-29
11   Park  Started     311 2020-05-23
12   Park  Started     311 2020-05-22
13   Park  Started     311 2020-05-21
14    Tom       60     150 2019-11-24
15    Tom       60     150 2019-11-23
16    Tom       60     150 2019-11-22
17    Tom      520     520 2019-08-26
18    Tom       99      78 2018-12-11
19    Tom  Started      99 2018-10-09
20   Wong  Started      39 2019-02-03
21   Wong  Started      39 2019-02-02
22   Wong  Started      39 2019-02-01

Upvotes: 1

Related Questions