Reputation: 607
What is the best option to create new DataFrame from a function applied to each row of a data frame. The ultimate goal is to concat (rbind) all the resulting new_dataframes.
Input:
Name Age
0 tom 10
1 nick 15
2 juli 14
Example:
import pandas as pd
import pdb
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
def foo(row):
#pdb.set_trace()
new_df = row.to_frame(name='Values')
new_df.loc[new_df.index=='Name','New_column'] = 'Surname'
new_df.loc[new_df.index=='Age','New_column'] = '+5 months'
return new_df
df.apply(foo, axis=1)
Output:
data = {'Values':['Tom', '10', 'nich', '15', 'juli', '14'],
'New_column': ['Surname', '+5 months', 'Surname', '+5 months', 'Surname',
'+5 months']}
output = pd.DataFrame(data)
Values New_column
0 Tom Surname
1 10 +5 months
2 nich Surname
3 15 +5 months
4 juli Surname
5 14 +5 months
If .apply() is not the best option, I would appreciate an alternative.
For R users, I am looking for do.call(rbind, sapply())
Thanks.
Upvotes: 2
Views: 1240
Reputation: 42916
Without using apply
which is pretty slow, we can use pandas
and numpy
methods here: transform
, melt
and numpy.tile
:
df = df.T.melt().drop(columns='variable')
df['New_column'] = np.tile(['Surname', '5+ months'], len(df)//2)
value New_column
0 tom Surname
1 10 5+ months
2 nick Surname
3 15 5+ months
4 juli Surname
5 14 5+ months
Upvotes: 1
Reputation: 187
Here a different approach that is using built-in functions of pandas and numpy.
import pandas as pd
import numpy as np
import pdb
# create df
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
# provide unique ids for each row
df['id']=df.index
# Unpivot DataFrame using unique id as reference
n = df.melt(id_vars=['id'], value_vars=['Name', 'Age'])
# add 'new_column' and updates its values with np.where
n['new_column'] = np.where(n['variable'] == 'Name', 'Surname', '+5 months')
# sort df to pair name and age
n.sort_values('id', inplace=True)
# assign row names
n.index = n['variable']
# drop unnecessary columns
n.drop(['id', 'variable'], axis = 1)
output:
value new_column
variable
Name tom Surname
Age 10 +5 months
Name nick Surname
Age 15 +5 months
Name juli Surname
Age 14 +5 months
Upvotes: 0
Reputation: 30971
Start from one improvement in your function:
def foo(row):
new_df = row.to_frame(name='Values')
new_df.loc['Name', 'New_column'] = 'Surname'
new_df.loc['Age', 'New_column'] = '+5 months'
return new_df
("new_df.index==" is not needed).
To get your output, convert the Series of DataFrames (resulting from apply) into an ordinaty list (of DataFrames) and concatenate them.
The code to do it is:
pd.concat(df.apply(foo, axis=1).tolist())
Upvotes: 2