Reputation: 259
My dataframe :
df
Object quantity
A 3
B 4
C 10
D 11
My desired result:
df
Object quantity
A 3
B 4
C 4
C 4
C 2
D 4
D 4
D 3
My Goal here is to split the value stored in the column2 "quantity" so that it should be 4 or less than 4.
Which method I can use to solve this problem? any suggestion would be appreciated.
Upvotes: 2
Views: 131
Reputation: 10150
Something like this could work. For each group where quantity is greater than 4, apply a function that does the row splitting and stores into a temp dataframe, then combine everything together to get your desired output:
df = pd.DataFrame({'idx': ['A', 'B', 'C', 'D'],
'quantity': [3, 4, 10, 11]})
def split_quant(df):
quantities = ([4]*(df['quantity'].iat[0] // 4)) + [df['quantity'].iat[0] % 4]
temp = pd.DataFrame({'idx': df['idx'].iat[0],
'quantity': quantities
}, index=range(len(quantities)))
temp = temp[temp['quantity']!=0]
return temp
df_split = df[df['quantity'] > 4].groupby('idx').apply(split_quant)
output = df[df['quantity'] <= 4].append(df_split).reset_index(drop=True)
writer = pd.ExcelWriter('output.xlsx')
output.to_excel(writer, 'Sheet1', index=False)
writer.save()
The above will give you the following output dataframe:
idx quantity
0 A 3
1 B 4
2 C 4
3 C 4
4 C 2
5 D 4
6 D 4
7 D 3
EDIT:
I took the liberty of running some timing tests of the various methods. Using Pandas' groupby
and apply
saves a lot of time and avoids the nested loops over the input data (although I'm sure theres an even faster way that can avoid the apply
as well...)
Mine:
5.49 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@Iqbal Basyar:
22.8 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
@sobek
17.7 ms ± 922 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1
Reputation: 1426
This works but as far as pandas is concerned, it's not pretty nor is it fast:
df = pd.DataFrame({'idx': ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
'quantity': [1., 2., 3., 4., 5., 6., 7.]})
df['factor'] = df.quantity // 4.
df['modulo'] = df.quantity % 4.
res = pd.DataFrame({'idx': [], 'quantity': []})
for idx, row in df.iterrows():
for idxx in range(int(row.factor)):
res = res.append({'idx': row.idx, 'quantity': 4.},
ignore_index=True)
if row.modulo > 0:
res = res.append({'idx': row.idx, 'quantity': row.modulo},
ignore_index=True)
In [24]: df
Out[24]:
idx quantity
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 F 6.0
6 G 7.0
In [22]: res
Out[22]:
idx quantity
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 4.0
5 E 1.0
6 F 4.0
7 F 2.0
8 G 4.0
9 G 3.0
Upvotes: 1
Reputation: 167
Unfortunately, Pandas did not support this feature. So you have to create a new dataframe based on your old dataframe.
For each item in old dataframe, calculate
old_quantity = n * 4 + rest_quantitity
So in the new dataframe, you will add n item(s) with quantity of 4 , plus one with quantity of rest_quantity (if rest_quantity is not zero)
df = df = pd.DataFrame({'item': ["A","B","C"], 'qty': [3, 8,11]})
new_df = pd.DataFrame({'Item': [], 'qty': []})
for idx, item in df.iterrows():
if item['qty'] > 4 :
n = item['qty'] // 4
r = item['qty'] % 4
for _ in range(n):
new_df.loc[len(new_df)] = [item['item'], 4]
if r > 0 :
new_df.loc[len(new_df)] = [item['item'], r]
else :
new_df.loc[len(new_df)] = [item['item'], item['qty']]
df
item qty
0 A 3
1 B 8
2 C 11
new_df
Item qty
0 A 3.0
1 B 4.0
2 B 4.0
3 C 4.0
4 C 4.0
5 C 3.0
Upvotes: 1