Reputation: 748
Lets say I start with a dataframe that has some data and a column of quantities:
In: df=pd.DataFrame({'first-name':['Jan','Leilani'],'Qty':[2,4]})
Out: Qty first-name
2 Jan
4 Leilani
I want to create a dataframe that copies and labels the data into new lines a number of times equal to the quantity on each line. Here is what the output should look like:
Qty first-name position
2 Jan 1
2 Jan 2
4 Leilani 1
4 Leilani 2
4 Leilani 3
4 Leilani 4
I can do this using python like so:
l=[]
x=0
for idx in df.index:
x=0
for _ in range(df.loc[idx]['Qty']):
x+=1
tempSrs=df.loc[idx]
tempSrs['position']=x
l.append(tempSrs)
outDf=pd.DataFrame(l)
This is very slow. Is there a way to do this using pandas functions? This is effectively an "unpivot", which in pandas is "melt", but I wasn't able to figure out how to use the melt function to accomplish this.
Thanks,
Upvotes: 5
Views: 4506
Reputation: 294308
The differences are:
loc
instead of reindex
(same thing)assign
instead of =
assignment (assign
produces a copy)lambda
to assign
to embed groupby
logicdf.loc[df.index.repeat(df.Qty)].assign(
position=lambda d: d.groupby('first-name').cumcount() + 1
)
Qty first-name position
0 2 jan 1
0 2 jan 2
1 4 jay 1
1 4 jay 2
1 4 jay 3
1 4 jay 4
np.arange
q = df.Qty.values
r = np.arange(q.sum()) - np.append(0, q[:-1]).cumsum().repeat(q) + 1
df.loc[df.index.repeat(q)].assign(position=r)
Qty first-name position
0 2 jan 1
0 2 jan 2
1 4 jay 1
1 4 jay 2
1 4 jay 3
1 4 jay 4
Upvotes: 4
Reputation: 323276
With repeat
and cumcount
Newdf=df.reindex(df.index.repeat(df.Qty))
Newdf['position']=Newdf.groupby(level=0).cumcount()+1
Newdf
Out[931]:
Qty first-name position
0 2 jan 1
0 2 jan 2
1 4 jay 1
1 4 jay 2
1 4 jay 3
1 4 jay 4
Upvotes: 7
Reputation: 164693
Here is an intuitive way using numpy.repeat
and itertools.chain
.
For larger dataframes, this is likely to be more efficient than a pandorable
method.
import pandas as pd
import numpy as np
from itertools import chain
df = pd.DataFrame({'first-name':['jan','jay'],'Qty':[2,4]})
lens = df['Qty'].values
res = pd.DataFrame({'Qty': np.repeat(df['Qty'], lens),
'first-name': np.repeat(df['first-name'], lens),
'Count': list(chain.from_iterable(range(1, i+1) for i in lens))})
print(res)
Count Qty first-name
0 1 2 jan
0 2 2 jan
1 1 4 jay
1 2 4 jay
1 3 4 jay
1 4 4 jay
Upvotes: 2