Justin CR
Justin CR

Reputation: 199

How to explode pandas dataframe based on number value in a specific column

I am trying to explode an existing dataframe based on a numeric value in a column. For example, if the column has a numeric value of 3, I want to have 3 of those rows, so on and so on.

Assuming we start with this dataframe:

inventory_partner inventory_partner2  calc
0              A1                 aa     1
1              A2                 bb     2
2              A3                 cc     5
3              A4                 dd     4
4              A5                 ee     5
5              A6                 ff     3

How do we get to this dataframe?

  inventory_partner inventory_partner2  calc
0                A1                 aa     1
1                A2                 bb     2
1                A2                 bb     2
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
3                A4                 dd     4
3                A4                 dd     4
3                A4                 dd     4
3                A4                 dd     4
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
5                A6                 ff     3
5                A6                 ff     3
5                A6                 ff     3

I have gotten this to work by using the below code, but I was wondering if there is an easier way to accomplish this without having to manually create the comma separated lists to feed into the explode method.

import pandas as pd

#create dataframe
d = {'inventory_partner': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6'], 'inventory_partner2': ['aa', 'bb', 'cc', 'dd', 'ee', 'ff'], 'calc': [1, 2, 5, 4, 5, 3]}
df1 = pd.DataFrame(data=d)

print(df1) #print original dataframe

#create my_comma_list column based on number values in calc column
df1.insert(3, 'my_comma_list', '')
df1.loc[df1['calc'] == 1, 'my_comma_list'] = '1'
df1.loc[df1['calc'] == 2, 'my_comma_list'] = '1, 2'
df1.loc[df1['calc'] == 3, 'my_comma_list'] = '1, 2, 3'
df1.loc[df1['calc'] == 4, 'my_comma_list'] = '1, 2, 3, 4'
df1.loc[df1['calc'] == 5, 'my_comma_list'] = '1, 2, 3, 4, 5'

print(df1) #print before row explosion

#explode the rows using the my_comma_list column to get desired number of rows
df1 = df1.assign(my_comma_list=df1['my_comma_list'].str.split(',')).explode('my_comma_list')
#drop the my_comma_list column since we no longer need it
del df1['my_comma_list']

print(df1) #print after row explosion

Upvotes: 4

Views: 1136

Answers (2)

Joe Ferndz
Joe Ferndz

Reputation: 8508

You can use Index.repeat and DataFrame.loc to repeat rows.

import pandas as pd

#create dataframe
d = {'inventory_partner': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6'],
     'inventory_partner2': ['aa', 'bb', 'cc', 'dd', 'ee', 'ff'],
     'calc': [1, 2, 5, 4, 5, 3]}
df1 = pd.DataFrame(data=d)
print (df1)
df1 = df1.loc[df1.index.repeat(df1['calc'])]
print (df1)

Output is:

Original DataFrame:

  inventory_partner inventory_partner2  calc
0                A1                 aa     1
1                A2                 bb     2
2                A3                 cc     5
3                A4                 dd     4
4                A5                 ee     5
5                A6                 ff     3

Updated DataFrame with repeated rows:

  inventory_partner inventory_partner2  calc
0                A1                 aa     1
1                A2                 bb     2
1                A2                 bb     2
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
3                A4                 dd     4
3                A4                 dd     4
3                A4                 dd     4
3                A4                 dd     4
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
5                A6                 ff     3
5                A6                 ff     3
5                A6                 ff     3

If you want to repeat rows based on a column value with a reference lookup, you can create a dictionary and identify how many times you want it to repeat, then use map to pass the value.

Let's say, you want to repeat based on the value in inventory_partner. Then you can do this:

import pandas as pd

inv_partner_dict = {'A1':1, 'A2':2, 'A3':5, 'A4':4,'A5':5,'A6':3}

#create dataframe
d = {'inventory_partner': ['A1', 'A2', 'A3', 'A4', 'A5', 'A6'],
     'inventory_partner2': ['aa', 'bb', 'cc', 'dd', 'ee', 'ff'],
     'calc': [1, 2, 5, 4, 5, 3]}
df1 = pd.DataFrame(data=d)


print (df1)
df1 = df1.loc[df1.index.repeat(df1['inventory_partner2'].map(inv_partner_dict))]
print (df1)

This will do the same thing.

The output of this will be:

Original DataFrame:

  inventory_partner inventory_partner2  calc
0                A1                 aa     1
1                A2                 bb     2
2                A3                 cc     5
3                A4                 dd     4
4                A5                 ee     5
5                A6                 ff     3

Updated DataFrame with repeated rows:

  inventory_partner inventory_partner2  calc
0                A1                 aa     1
1                A2                 bb     2
1                A2                 bb     2
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
2                A3                 cc     5
3                A4                 dd     4
3                A4                 dd     4
3                A4                 dd     4
3                A4                 dd     4
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
4                A5                 ee     5
5                A6                 ff     3
5                A6                 ff     3
5                A6                 ff     3

Upvotes: 5

sammywemmy
sammywemmy

Reputation: 28644

Use pd.Series.repeat to get the index and then reindex:

df.reindex(df.inventory_partner2.repeat(df.calc).index)

Upvotes: 1

Related Questions