D. F. R. 123
D. F. R. 123

Reputation: 97

How to duplicate rows from a DataFrame based on a column in Python?

I need to duplicate a row based on a column (a number).

I have this dataframe:

         Kids       City Preferred Transport  Count
0       Sofia     Boston                 Bus    1.0
1  Claire Ann        NaN                 NaN    1.0
2         Joe    Detroit               Train    3.0
3       Betty        NaN                 Car    2.0
4      Archie  Bruxelles                 NaN    1.0
5         Joe        NaN            Airplane    3.0
6        Phil     Berlin                Ship    1.0
7        Luke        NaN            Airplane    1.0

Each kid has a count number. It should appear in the dataframe as much as the number from the Count column shows. (Ex: it should only be 1 Sofia, 2 Bettys, and 3 Joes). But only have their names copied, like this:

         Kids       City Preferred Transport  Count
0       Sofia     Boston                 Bus    1.0
1  Claire Ann        NaN                 NaN    1.0
2         Joe    Detroit               Train    3.0
3       Betty        NaN                 Car    2.0
4      Archie  Bruxelles                 NaN    1.0
5         Joe        NaN            Airplane    3.0
6        Phil     Berlin                Ship    1.0
7        Luke        NaN            Airplane    1.0
8        Joe         NaN                NaN     3.0
9        Betty       NaN                NaN     2.0

I tried this code, using Pandas:

kid = nyc_trip_df.loc[nyc_trip_df['Kids'].count() > nyc_trip_df['Count'], 'Kids']

Because I thought that I would be a good idea to put all the kids that have their names appear less then their corresponding number, but somewhere I failed. And if it was successfull, then I would just append the kids to the dataframe, but I am not sure that this is the best approach though....

Can you help me please?

How can I make this work ?

Upvotes: 3

Views: 72

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Solution

s1 = df['Kids'].value_counts()
s2 = df.groupby('Kids')['Count'].first()   

df = df.append(s2.loc[s2.index.repeat(s2 - s1)].reset_index(), ignore_index=True)

Explanations

Calculate the count of unique values in Kids column using value_counts

>>> s1

Joe           2
Sofia         1
Betty         1
Claire Ann    1
Phil          1
Luke          1
Archie        1
Name: Kids, dtype: int64

Group the dataframe by Kids column and aggregate the Count column using first

>>> s2

Kids
Archie        1.0
Betty         2.0
Claire Ann    1.0
Joe           3.0
Luke          1.0
Phil          1.0
Sofia         1.0
Name: Count, dtype: float64

Subtract s2 from s1 to create a series which contains the number of times the name of each Kid should be repeated

>>> s2 - s1

Archie        0.0
Betty         1.0
Claire Ann    0.0
Joe           1.0
Luke          0.0
Phil          0.0
Sofia         0.0
dtype: float64

Now, repeat the index of s2 using the above calculated repetitions

>>> s2.loc[s2.index.repeat(s2 - s1)]

Kids
Betty    2.0
Joe      3.0
Name: Count, dtype: float64

Append the above repetitions to the original dataframe df

>>> df

         Kids       City Preferred Transport  Count
0       Sofia     Boston                 Bus    1.0
1  Claire Ann        NaN                 NaN    1.0
2         Joe    Detroit               Train    3.0
3       Betty        NaN                 Car    2.0
4      Archie  Bruxelles                 NaN    1.0
5         Joe        NaN            Airplane    3.0
6        Phil     Berlin                Ship    1.0
7        Luke        NaN            Airplane    1.0
8       Betty        NaN                 NaN    2.0
9         Joe        NaN                 NaN    3.0

Upvotes: 2

Related Questions