Reputation: 97
I need to duplicate a row based on a column (a number).
I have this dataframe:
Kids City Preferred Transport Count
0 Sofia Boston Bus 1.0
1 Claire Ann NaN NaN 1.0
2 Joe Detroit Train 3.0
3 Betty NaN Car 2.0
4 Archie Bruxelles NaN 1.0
5 Joe NaN Airplane 3.0
6 Phil Berlin Ship 1.0
7 Luke NaN Airplane 1.0
Each kid has a count number. It should appear in the dataframe as much as the number from the Count column shows. (Ex: it should only be 1 Sofia, 2 Bettys, and 3 Joes). But only have their names copied, like this:
Kids City Preferred Transport Count
0 Sofia Boston Bus 1.0
1 Claire Ann NaN NaN 1.0
2 Joe Detroit Train 3.0
3 Betty NaN Car 2.0
4 Archie Bruxelles NaN 1.0
5 Joe NaN Airplane 3.0
6 Phil Berlin Ship 1.0
7 Luke NaN Airplane 1.0
8 Joe NaN NaN 3.0
9 Betty NaN NaN 2.0
I tried this code, using Pandas:
kid = nyc_trip_df.loc[nyc_trip_df['Kids'].count() > nyc_trip_df['Count'], 'Kids']
Because I thought that I would be a good idea to put all the kids that have their names appear less then their corresponding number, but somewhere I failed. And if it was successfull, then I would just append the kids to the dataframe, but I am not sure that this is the best approach though....
Can you help me please?
How can I make this work ?
Upvotes: 3
Views: 72
Reputation: 71689
s1 = df['Kids'].value_counts()
s2 = df.groupby('Kids')['Count'].first()
df = df.append(s2.loc[s2.index.repeat(s2 - s1)].reset_index(), ignore_index=True)
Calculate the count
of unique values in Kids
column using value_counts
>>> s1
Joe 2
Sofia 1
Betty 1
Claire Ann 1
Phil 1
Luke 1
Archie 1
Name: Kids, dtype: int64
Group
the dataframe by Kids
column and aggregate the Count
column using first
>>> s2
Kids
Archie 1.0
Betty 2.0
Claire Ann 1.0
Joe 3.0
Luke 1.0
Phil 1.0
Sofia 1.0
Name: Count, dtype: float64
Subtract s2
from s1
to create a series which contains the number of times the name of each Kid should be repeated
>>> s2 - s1
Archie 0.0
Betty 1.0
Claire Ann 0.0
Joe 1.0
Luke 0.0
Phil 0.0
Sofia 0.0
dtype: float64
Now, repeat the index of s2
using the above calculated repetitions
>>> s2.loc[s2.index.repeat(s2 - s1)]
Kids
Betty 2.0
Joe 3.0
Name: Count, dtype: float64
Append
the above repetitions to the original dataframe df
>>> df
Kids City Preferred Transport Count
0 Sofia Boston Bus 1.0
1 Claire Ann NaN NaN 1.0
2 Joe Detroit Train 3.0
3 Betty NaN Car 2.0
4 Archie Bruxelles NaN 1.0
5 Joe NaN Airplane 3.0
6 Phil Berlin Ship 1.0
7 Luke NaN Airplane 1.0
8 Betty NaN NaN 2.0
9 Joe NaN NaN 3.0
Upvotes: 2