Reputation: 61
I have a function that is getting passed a pandas dataframe, and for each row in that dataframe I'd like to create N other rows, each of which are equivalent to the original row except for 2 column values.
What's the right way to do this - especially in a RAM-effective manner?
My attempts so far have been to run pd.apply
, and then for each of those rows call a function that returns a list of pd.Series
objects that I would then call append
on to add them to the original DataFrame
. This hasn't worked out, though.
Here is an example I tried with some dummy code to replicate:
students = [ ('Jack', 34, 'Sydney' , 'Australia') ,
('Jill', 30, 'New York' , 'USA' ) ]
# Create a DataFrame object
df = pd.DataFrame(students, columns = ['Name' , 'Age', 'City' , 'Country'], index=['a', 'b', 'c' , 'd' , 'e' , 'f'])
# function I will use to explode a single row into 10 new rows
def replicate(x):
new_rows = []
i = 0
for j in range(3):
y = x.copy(deep=True)
y.Age = i
i += 1
new_rows.append(y)
return new_rows
# Iterate over each row and append the results
df.apply(lambda x: df.append(replicate(x))
For the above, I'd expect output like the following:
Jack, 34, Sydney, Australia
Jack, 0, Sydney, Australia
Jack, 1, Sydney, Australia
Jack, 2, Sydney, Australia
Jill, 30, New York, USA
Jill, 0, New York, USA
Jill, 1, New York, USA
Jill, 2, New York, USA
In the end, I'd like my dataframe to have N times as many rows, where I can compute derived rows from the original rows. I'd like to do this in a space effective manner, and this isn't happening right now. Any help is appreciated!
Upvotes: 6
Views: 2324
Reputation: 323236
IIUC
d={x : y.set_index('Age').reindex(range(y['Age'].iloc[0]+1),method='bfill') for x , y in df.groupby(level=0)}
newdf=pd.concat(d).reset_index(level=1)
newdf
Out[220]:
Age Name City Country
a 0 Jack Sydney Australia
a 1 Jack Sydney Australia
a 2 Jack Sydney Australia
a 3 Jack Sydney Australia
a 4 Jack Sydney Australia
a 5 Jack Sydney Australia
a 6 Jack Sydney Australia
a 7 Jack Sydney Australia
a 8 Jack Sydney Australia
a 9 Jack Sydney Australia
a 10 Jack Sydney Australia
a 11 Jack Sydney Australia
a 12 Jack Sydney Australia
a 13 Jack Sydney Australia
a 14 Jack Sydney Australia
a 15 Jack Sydney Australia
a 16 Jack Sydney Australia
a 17 Jack Sydney Australia
a 18 Jack Sydney Australia
a 19 Jack Sydney Australia
a 20 Jack Sydney Australia
a 21 Jack Sydney Australia
a 22 Jack Sydney Australia
a 23 Jack Sydney Australia
a 24 Jack Sydney Australia
a 25 Jack Sydney Australia
a 26 Jack Sydney Australia
a 27 Jack Sydney Australia
a 28 Jack Sydney Australia
a 29 Jack Sydney Australia
.. ... ... ... ...
b 1 Jill New York USA
b 2 Jill New York USA
b 3 Jill New York USA
b 4 Jill New York USA
b 5 Jill New York USA
b 6 Jill New York USA
b 7 Jill New York USA
b 8 Jill New York USA
b 9 Jill New York USA
b 10 Jill New York USA
b 11 Jill New York USA
b 12 Jill New York USA
b 13 Jill New York USA
b 14 Jill New York USA
b 15 Jill New York USA
b 16 Jill New York USA
b 17 Jill New York USA
b 18 Jill New York USA
b 19 Jill New York USA
b 20 Jill New York USA
b 21 Jill New York USA
b 22 Jill New York USA
b 23 Jill New York USA
b 24 Jill New York USA
b 25 Jill New York USA
b 26 Jill New York USA
b 27 Jill New York USA
b 28 Jill New York USA
b 29 Jill New York USA
b 30 Jill New York USA
[66 rows x 4 columns]
Upvotes: 1
Reputation: 59549
IIUC, you want np.repeat
, using the Age
column to specify the number of repeats, then fix the age column after the fact.
import pandas as pd
df1 = pd.DataFrame(df.values.repeat(df.Age+1, axis=0),
columns=['Name', 'Age', 'City', 'Country'])
df1['Age'] = (df1.groupby([*df1]).cumcount()-1).where(df1.duplicated(), df1['Age'])
df1
: Name Age City Country
0 Jack 34 Sydney Australia
1 Jack 0 Sydney Australia
2 Jack 1 Sydney Australia
3 Jack 2 Sydney Australia
4 Jack 3 Sydney Australia
...
34 Jack 33 Sydney Australia
35 Jill 30 New York USA
...
63 Jill 27 New York USA
64 Jill 28 New York USA
65 Jill 29 New York USA
[66 rows x 4 columns]
df
: Name Age City Country
a Jack 34 Sydney Australia
b Jill 30 New York USA
Upvotes: 1
Reputation: 365
You can put you dataframe inside a list and then do whatever you want:
# x5 row duplicate
df = df.append([df]*5, ignore_index=True)
df.sort_values(by='Name').head(15)
# Result
Name Age City Country
28 John 16 New York US
4 John 16 New York US
22 John 16 New York US
34 John 16 New York US
16 John 16 New York US
10 John 16 New York US
17 Mike 17 las vegas US
29 Mike 17 las vegas US
23 Mike 17 las vegas US
11 Mike 17 las vegas US
35 Mike 17 las vegas US
5 Mike 17 las vegas US
3 Neelu 32 Bangalore India
33 Neelu 32 Bangalore India
15 Neelu 32 Bangalore India
Upvotes: 0