Reputation: 122
I have a dataframe that looks like :
df = pd.DataFrame({'qty': [10,7,2,1],
'status 1': [5,2,2,0],
'status 2': [3,2,0,1],
'status 3': [2,3,0,0]
})
Each row has a qty of items. These items have one status (1,2 or 3).
So qty = sum of values of status 1,2,3.
I would like to :
Duplicate each row by the "qty" column
Then edit 3 status (or update a new column), to get just 1 status.
The output should look like this:
Edit: the order is not important, but I will need to keep other columns of my initial df.
My (incomplete) solution so far - I found a way to duplicate the rows using this :
df2= df2.loc[df2.index.repeat(df2['qty'])].reset_index(drop=True)
But I can't find a way to fill the status.
Do I need to use a for loop approach to fill the status?
Should I do this transform in 1 or 2 steps?
Something like: for each initial row, the n first rows take the first status, where n is the value of status 2....
The output could maybe looks like :
Edit1 : Thank you for your answers !
Last question : now I'm trying to integrate this to my actual df. What is the best approach to apply these methods to my df who contains many other column ?
I will answer my last question :
Split df in 2: dfstatus
and dfwithoutstatus
, keeping the qty column in both
Apply one of your method on the dfstatus
Apply my method on the dfwithoutstatus
(a simple duplication)
Merge on index
Thank you all for your answers.
Best
Upvotes: 4
Views: 1854
Reputation: 8768
Here is a way:
(df[['qty']].join(df.iloc[:,1:].stack()
.map(lambda x: list(range(1,x+1)))
.explode()
.dropna()
.fillna(0)
.to_frame()
.assign(cc = lambda x: x.groupby(level=0).cumcount())
.set_index('cc',append=True)[0]
.unstack(level=1)
.droplevel(1)
.fillna(0)
.astype(bool)
.astype(int))
.reset_index(drop=True))
or with using np.identity()
cols = ['status 1','status 2','status 3']
(df[['qty']].join(
df[cols]
.stack()
.groupby(level=0)
.apply(lambda x: pd.DataFrame(np.repeat(np.identity(len(x)),x,axis=0)))
.droplevel(1)
.set_axis(cols,axis=1)))
or
l = df[cols].apply(lambda x: np.repeat(np.identity(len(x)),x,axis=0),axis=1).rename('t').explode()
df[['qty']].join(pd.DataFrame(l.tolist(),index = l.index))
or
df2 = df.filter(like = 's')
df[['qty']].join(pd.DataFrame(np.repeat(np.tile(np.eye(df2.shape[-1]),(len(df2),1)),df2.stack(),axis=0),index = np.repeat(np.arange(len(df2)),df2.sum(axis=1)))).reset_index(drop=True)
Output:
qty status 1 status 2 status 3
0 10 1 0 0
1 10 1 0 0
2 10 1 0 0
3 10 1 0 0
4 10 1 0 0
5 10 0 1 0
6 10 0 1 0
7 10 0 1 0
8 10 0 0 1
9 10 0 0 1
10 7 1 0 0
11 7 1 0 0
12 7 0 1 0
13 7 0 1 0
14 7 0 0 1
15 7 0 0 1
16 7 0 0 1
17 2 1 0 0
18 2 1 0 0
19 1 0 1 0
Upvotes: 2
Reputation: 2263
Proposed code :
This approach uses a trick (matrix identity)
import pandas as pd
import numpy as np
df = pd.DataFrame({'qty': [10,7,2,1],
'status 1': [5,2,2,0],
'status 2': [3,2,0,1],
'status 3': [2,3,0,0]
})
cols = df.columns[1:4]
# Define a boolean identity matrix (Trick 1)
ident = pd.DataFrame(np.eye(len(cols)).astype(bool))
# Trick 2 is to repeat rows 3 times before calculation
rep = df.loc[df.index.repeat(3)]
def func(g):
ident.index, ident.columns = g.index, g.columns
return (g.where(ident, 0)
.applymap(lambda e: int(e)*[1] if e>0 else e))
# Break in matrix groups 3X3
rep[cols] = rep.groupby(rep.index).apply(lambda g: func(g[cols]))
# Explode lists
for c in rep.columns:
rep = rep.explode(c)
# Deletes 0-valued rows
rep = rep[rep[cols].sum(axis=1) > 0].reset_index(drop=True)
print(rep)
Results :
qty status 1 status 2 status 3
0 10 1 0 0
1 10 1 0 0
2 10 1 0 0
3 10 1 0 0
4 10 1 0 0
5 10 0 1 0
6 10 0 1 0
7 10 0 1 0
8 10 0 0 1
9 10 0 0 1
10 7 1 0 0
11 7 1 0 0
12 7 0 1 0
13 7 0 1 0
14 7 0 0 1
15 7 0 0 1
16 7 0 0 1
17 2 1 0 0
18 2 1 0 0
19 1 0 1 0
Upvotes: 1
Reputation: 15364
Here is a possible solution:
import numpy as np
import pandas as pd
E = pd.DataFrame(np.eye(df.shape[1] - 1, dtype=int))
result = pd.DataFrame(
df['qty'].reindex(df.index.repeat(df['qty'])).reset_index(drop=True),
)
result[df.columns[1:]] = pd.concat(
[E.reindex(E.index.repeat(df.iloc[i, 1:]))
for i in range(len(df))],
).reset_index(
drop=True,
)
Here is the result:
>>> result
qty status 1 status 2 status 3
0 10 1 0 0
1 10 1 0 0
2 10 1 0 0
3 10 1 0 0
4 10 1 0 0
5 10 0 1 0
6 10 0 1 0
7 10 0 1 0
8 10 0 0 1
9 10 0 0 1
10 7 1 0 0
11 7 1 0 0
12 7 0 1 0
13 7 0 1 0
14 7 0 0 1
15 7 0 0 1
16 7 0 0 1
17 2 1 0 0
18 2 1 0 0
19 1 0 1 0
Upvotes: 3
Reputation: 153460
You can using this.
Instead of repeating on df['qty'], repeat on the status themselves, the concatenate the results and sort:
df = pd.DataFrame({'qty': [10,7,2,1],
'status 1': [5,2,2,0],
'status 2': [3,2,0,1],
'status 3': [2,3,0,0]
})
cols = ['status 1', 'status 2', 'status 3']
df_out = pd.concat([df.loc[df.index.repeat(df[col]), [col]+['qty']].reset_index()
for col in cols], ignore_index=True).sort_values(['index']+cols)
df_out[cols] = df_out[cols].notna().astype(int)
df_out[['qty']+cols]
Output:
qty status 1 status 2 status 3
0 10 1 0 0
1 10 1 0 0
2 10 1 0 0
3 10 1 0 0
4 10 1 0 0
9 10 0 1 0
10 10 0 1 0
11 10 0 1 0
15 10 0 0 1
16 10 0 0 1
5 7 1 0 0
6 7 1 0 0
12 7 0 1 0
13 7 0 1 0
17 7 0 0 1
18 7 0 0 1
19 7 0 0 1
7 2 1 0 0
8 2 1 0 0
14 1 0 1 0
Upvotes: 2
Reputation: 431
hardcoded, but a more human readable approach:
import pandas as pd
df = pd.DataFrame({'qty': [10,7,2,1],
'status 1': [5,2,2,0],
'status 2': [3,2,0,1],
'status 3': [2,3,0,0]
})
df2 = pd.DataFrame(data=None, columns=df.columns)
cnt = 0
for idx, row in df.iterrows():
s_one = row['status 1']
s_two = row['status 2']
s_three = row['status 3']
while s_one > 0:
df2.loc[cnt] = [row['qty'],1,0,0]
s_one-=1
cnt+=1
while s_two > 0:
df2.loc[cnt] = [row['qty'],0,1,0]
s_two-=1
cnt+=1
while s_three > 0:
df2.loc[cnt] = [row['qty'],0,0,1]
s_three-=1
cnt+=1
print(df2)
same output.
Upvotes: 1