chris_olv
chris_olv

Reputation: 122

Duplicate row in Pandas dataframe based on condition, then update a new column based on condition

I have a dataframe that looks like :

enter image description here

df = pd.DataFrame({'qty': [10,7,2,1],
                   'status 1': [5,2,2,0],
                   'status 2': [3,2,0,1],
                   'status 3': [2,3,0,0]
                   })

Each row has a qty of items. These items have one status (1,2 or 3).

So qty = sum of values of status 1,2,3.

I would like to :

  1. Duplicate each row by the "qty" column

  2. Then edit 3 status (or update a new column), to get just 1 status.

The output should look like this:

enter image description here

Edit: the order is not important, but I will need to keep other columns of my initial df.

My (incomplete) solution so far - I found a way to duplicate the rows using this :

df2= df2.loc[df2.index.repeat(df2['qty'])].reset_index(drop=True)

But I can't find a way to fill the status.

Do I need to use a for loop approach to fill the status?

Should I do this transform in 1 or 2 steps?

Something like: for each initial row, the n first rows take the first status, where n is the value of status 2....

The output could maybe looks like :

enter image description here

Edit1 : Thank you for your answers !

Last question : now I'm trying to integrate this to my actual df. What is the best approach to apply these methods to my df who contains many other column ?

I will answer my last question :

  1. Split df in 2: dfstatus and dfwithoutstatus, keeping the qty column in both

  2. Apply one of your method on the dfstatus

  3. Apply my method on the dfwithoutstatus (a simple duplication)

  4. Merge on index

Thank you all for your answers.

Best

Upvotes: 4

Views: 1854

Answers (5)

rhug123
rhug123

Reputation: 8768

Here is a way:

(df[['qty']].join(df.iloc[:,1:].stack()
.map(lambda x: list(range(1,x+1)))
.explode()
.dropna()
.fillna(0)
.to_frame()
.assign(cc = lambda x: x.groupby(level=0).cumcount())
.set_index('cc',append=True)[0]
.unstack(level=1)
.droplevel(1)
.fillna(0)
.astype(bool)
.astype(int))
.reset_index(drop=True))

or with using np.identity()

cols = ['status 1','status 2','status 3']

(df[['qty']].join(
    df[cols]
    .stack()
    .groupby(level=0)
    .apply(lambda x: pd.DataFrame(np.repeat(np.identity(len(x)),x,axis=0)))
    .droplevel(1)
    .set_axis(cols,axis=1)))

or

l = df[cols].apply(lambda x: np.repeat(np.identity(len(x)),x,axis=0),axis=1).rename('t').explode()
df[['qty']].join(pd.DataFrame(l.tolist(),index = l.index))

or

df2 = df.filter(like = 's')

df[['qty']].join(pd.DataFrame(np.repeat(np.tile(np.eye(df2.shape[-1]),(len(df2),1)),df2.stack(),axis=0),index = np.repeat(np.arange(len(df2)),df2.sum(axis=1)))).reset_index(drop=True)

Output:

    qty  status 1  status 2  status 3
0    10         1         0         0
1    10         1         0         0
2    10         1         0         0
3    10         1         0         0
4    10         1         0         0
5    10         0         1         0
6    10         0         1         0
7    10         0         1         0
8    10         0         0         1
9    10         0         0         1
10    7         1         0         0
11    7         1         0         0
12    7         0         1         0
13    7         0         1         0
14    7         0         0         1
15    7         0         0         1
16    7         0         0         1
17    2         1         0         0
18    2         1         0         0
19    1         0         1         0

Upvotes: 2

Laurent B.
Laurent B.

Reputation: 2263

Proposed code :

This approach uses a trick (matrix identity)

import pandas as pd
import numpy as np

df = pd.DataFrame({'qty': [10,7,2,1],
                   'status 1': [5,2,2,0],
                   'status 2': [3,2,0,1],
                   'status 3': [2,3,0,0]
                   })

cols = df.columns[1:4]
# Define a boolean identity matrix (Trick 1)
ident = pd.DataFrame(np.eye(len(cols)).astype(bool))

# Trick 2 is to repeat rows 3 times before calculation
rep = df.loc[df.index.repeat(3)]

def func(g):
   ident.index, ident.columns = g.index, g.columns
   return (g.where(ident, 0)
            .applymap(lambda e: int(e)*[1] if e>0 else e))

# Break in matrix groups 3X3
rep[cols] = rep.groupby(rep.index).apply(lambda g: func(g[cols]))

# Explode lists
for c in rep.columns:
   rep = rep.explode(c)

# Deletes 0-valued rows
rep = rep[rep[cols].sum(axis=1) > 0].reset_index(drop=True)

print(rep)

Results :

    qty status 1 status 2 status 3
0    10        1        0        0
1    10        1        0        0
2    10        1        0        0
3    10        1        0        0
4    10        1        0        0
5    10        0        1        0
6    10        0        1        0
7    10        0        1        0
8    10        0        0        1
9    10        0        0        1
10    7        1        0        0
11    7        1        0        0
12    7        0        1        0
13    7        0        1        0
14    7        0        0        1
15    7        0        0        1
16    7        0        0        1
17    2        1        0        0
18    2        1        0        0
19    1        0        1        0

Upvotes: 1

Riccardo Bucco
Riccardo Bucco

Reputation: 15364

Here is a possible solution:

import numpy as np
import pandas as pd

E = pd.DataFrame(np.eye(df.shape[1] - 1, dtype=int))
result = pd.DataFrame(
    df['qty'].reindex(df.index.repeat(df['qty'])).reset_index(drop=True),
)
result[df.columns[1:]] = pd.concat(
    [E.reindex(E.index.repeat(df.iloc[i, 1:]))
     for i in range(len(df))],
).reset_index(
    drop=True,
)

Here is the result:

>>> result
    qty  status 1  status 2  status 3
0    10         1         0         0
1    10         1         0         0
2    10         1         0         0
3    10         1         0         0
4    10         1         0         0
5    10         0         1         0
6    10         0         1         0
7    10         0         1         0
8    10         0         0         1
9    10         0         0         1
10    7         1         0         0
11    7         1         0         0
12    7         0         1         0
13    7         0         1         0
14    7         0         0         1
15    7         0         0         1
16    7         0         0         1
17    2         1         0         0
18    2         1         0         0
19    1         0         1         0

Upvotes: 3

Scott Boston
Scott Boston

Reputation: 153460

You can using this.
Instead of repeating on df['qty'], repeat on the status themselves, the concatenate the results and sort:

df = pd.DataFrame({'qty': [10,7,2,1],
                   'status 1': [5,2,2,0],
                   'status 2': [3,2,0,1],
                   'status 3': [2,3,0,0]
                   })
cols = ['status 1', 'status 2', 'status 3']

df_out = pd.concat([df.loc[df.index.repeat(df[col]), [col]+['qty']].reset_index()
 for col in cols], ignore_index=True).sort_values(['index']+cols)

df_out[cols] = df_out[cols].notna().astype(int)

df_out[['qty']+cols]

Output:

    qty  status 1  status 2  status 3
0    10         1         0         0
1    10         1         0         0
2    10         1         0         0
3    10         1         0         0
4    10         1         0         0
9    10         0         1         0
10   10         0         1         0
11   10         0         1         0
15   10         0         0         1
16   10         0         0         1
5     7         1         0         0
6     7         1         0         0
12    7         0         1         0
13    7         0         1         0
17    7         0         0         1
18    7         0         0         1
19    7         0         0         1
7     2         1         0         0
8     2         1         0         0
14    1         0         1         0

Upvotes: 2

Jacob Kearney
Jacob Kearney

Reputation: 431

hardcoded, but a more human readable approach:

import pandas as pd

df = pd.DataFrame({'qty': [10,7,2,1],
                   'status 1': [5,2,2,0],
                   'status 2': [3,2,0,1],
                   'status 3': [2,3,0,0]
                   })
df2 = pd.DataFrame(data=None, columns=df.columns)

cnt = 0
for idx, row in df.iterrows():
    s_one = row['status 1']
    s_two = row['status 2']
    s_three = row['status 3']
    while s_one > 0:
        df2.loc[cnt] = [row['qty'],1,0,0]
        s_one-=1
        cnt+=1
    while s_two > 0:
        df2.loc[cnt] = [row['qty'],0,1,0]
        s_two-=1
        cnt+=1
    while s_three > 0:
        df2.loc[cnt] = [row['qty'],0,0,1]
        s_three-=1
        cnt+=1

print(df2)

same output.

Upvotes: 1

Related Questions