Reputation: 33970
How to do this in pandas:
I have a function extract_text_features
on a single text column, returning multiple output columns. Specifically, the function returns 6 values.
The function works, however there doesn't seem to be any proper return type (pandas DataFrame/ numpy array/ Python list) such that the output can get correctly assigned df.ix[: ,10:16] = df.textcol.map(extract_text_features)
So I think I need to drop back to iterating with df.iterrows()
, as per this?
UPDATE:
Iterating with df.iterrows()
is at least 20x slower, so I surrendered and split out the function into six distinct .map(lambda ...)
calls.
UPDATE 2: this question was asked back around v0.11.0, before the useability of df.apply
was improved or df.assign()
was added in v0.16. Hence much of the question and answers are not too relevant since then.
Upvotes: 389
Views: 378138
Reputation: 1075
Just to add to this, for me it was also necessary in some cases to use the unstack()
method, because otherwise I'd just get a new column that contained a dictionary.
It works like this:
df.groupby('variable')['value'].apply(lambda grp: {
'Min': grp.min(),
'Median': grp.median(),
'Max': grp.max()
}).unstack()
Upvotes: 0
Reputation: 4414
In 2020, I use apply()
with argument result_type='expand'
applied_df = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')
df = pd.concat([df, applied_df], axis='columns')
fn()
should return a dict
; its keys will be the new column names.
Alternatively you can do a one-liner by also specifying the column names:
df[["col1", "col2", ...]] = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')
Upvotes: 277
Reputation: 8152
Although the question specifies that the function should be applied to a Series, most of the answers seem to be applying the function to a DataFrame, with the function getting the relevant column from each row. This seems somewhat inelegant and potentially slow.
Say the function f
takes a value in column df["argument"]
and returns two values. The nicest way I've found to do it by applying to the column Series is this:
df[["value_1", "value_2"]] = df["argument"].apply(f).to_list()
Unlike DataFrame.apply
, unfortunately Series.apply
has no result_type
parameter to expand the result into a DataFrame to assign to. But pandas understands just as well if you assign to a list of tuples.
Upvotes: 1
Reputation: 258
This works for me:
import pandas as pd
import numpy as np
future = pd.DataFrame(
pd.date_range('2022-09-01',periods=360),
columns=['date']
)
def featurize(datetime):
return pd.Series({
'month':datetime.month,
'year':datetime.year,
'dayofweek':datetime.dayofweek,
'dayofyear':datetime.dayofyear
})
future.loc[
:,['month','year','dayofweek','dayofyear']
] = future.date.apply(featurize)
future.head()
Output:
date month year dayofweek dayofyear
0 2022-09-01 9 2022 3 244
1 2022-09-02 9 2022 4 245
2 2022-09-03 9 2022 5 246
3 2022-09-04 9 2022 6 247
4 2022-09-05 9 2022 0 248
Upvotes: 3
Reputation: 12417
For me this worked:
Input df
df = pd.DataFrame({'col x': [1,2,3]})
col x
0 1
1 2
2 3
Function
def f(x):
return pd.Series([x*x, x*x*x])
Create 2 new columns:
df[['square x', 'cube x']] = df['col x'].apply(f)
Output:
col x square x cube x
0 1 1 1
1 2 4 8
2 3 9 27
Upvotes: 44
Reputation: 41
def extract_text_features(feature):
...
...
return pd.Series((feature1, feature2))
df[['NewFeature1', 'NewFeature1']] = df[['feature']].apply(extract_text_features, axis=1)
Here the a dataframe with a single feature is being converted to two new features. Give this a try too.
Upvotes: 4
Reputation: 1074
I have a more complicated situation, the dataset has a nested structure:
import json
data = '{"TextID":{"0":"0038f0569e","1":"003eb6998d","2":"006da49ea0"},"Summary":{"0":{"Crisis_Level":["c"],"Type":["d"],"Special_Date":["a"]},"1":{"Crisis_Level":["d"],"Type":["a","d"],"Special_Date":["a"]},"2":{"Crisis_Level":["d"],"Type":["a"],"Special_Date":["a"]}}}'
df = pd.DataFrame.from_dict(json.loads(data))
print(df)
output:
TextID Summary
0 0038f0569e {'Crisis_Level': ['c'], 'Type': ['d'], 'Specia...
1 003eb6998d {'Crisis_Level': ['d'], 'Type': ['a', 'd'], 'S...
2 006da49ea0 {'Crisis_Level': ['d'], 'Type': ['a'], 'Specia...
The Summary
column contains dict objects, so I use apply
with from_dict
and stack
to extract each row of dict:
df2 = df.apply(
lambda x: pd.DataFrame.from_dict(x[1], orient='index').stack(), axis=1)
print(df2)
output:
Crisis_Level Special_Date Type
0 0 0 1
0 c a d NaN
1 d a a d
2 d a a NaN
Looks good, but missing the TextID
column. To get TextID
column back, I've tried three approach:
Modify apply
to return multiple columns:
df_tmp = df.copy()
df_tmp[['TextID', 'Summary']] = df.apply(
lambda x: pd.Series([x[0], pd.DataFrame.from_dict(x[1], orient='index').stack()]), axis=1)
print(df_tmp)
output:
TextID Summary
0 0038f0569e Crisis_Level 0 c
Type 0 d
Spec...
1 003eb6998d Crisis_Level 0 d
Type 0 a
...
2 006da49ea0 Crisis_Level 0 d
Type 0 a
Spec...
But this is not what I want, the Summary
structure are flatten.
Use pd.concat
:
df_tmp2 = pd.concat([df['TextID'], df2], axis=1)
print(df_tmp2)
output:
TextID (Crisis_Level, 0) (Special_Date, 0) (Type, 0) (Type, 1)
0 0038f0569e c a d NaN
1 003eb6998d d a a d
2 006da49ea0 d a a NaN
Looks fine, the MultiIndex
column structure are preserved as tuple. But check columns type:
df_tmp2.columns
output:
Index(['TextID', ('Crisis_Level', 0), ('Special_Date', 0), ('Type', 0),
('Type', 1)],
dtype='object')
Just as a regular Index
class, not MultiIndex
class.
use set_index
:
Turn all columns you want to preserve into row index, after some complicated apply
function and then reset_index
to get columns back:
df_tmp3 = df.set_index('TextID')
df_tmp3 = df_tmp3.apply(
lambda x: pd.DataFrame.from_dict(x[0], orient='index').stack(), axis=1)
df_tmp3 = df_tmp3.reset_index(level=0)
print(df_tmp3)
output:
TextID Crisis_Level Special_Date Type
0 0 0 1
0 0038f0569e c a d NaN
1 003eb6998d d a a d
2 006da49ea0 d a a NaN
Check the type of columns
df_tmp3.columns
output:
MultiIndex(levels=[['Crisis_Level', 'Special_Date', 'Type', 'TextID'], [0, 1, '']],
codes=[[3, 0, 1, 2, 2], [2, 0, 0, 0, 1]])
So, If your apply
function will return MultiIndex
columns, and you want to preserve it, you may want to try the third method.
Upvotes: 0
Reputation: 40638
Building off of user1827356 's answer, you can do the assignment in one pass using df.merge
:
df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})),
left_index=True, right_index=True)
textcol feature1 feature2
0 0.772692 1.772692 -0.227308
1 0.857210 1.857210 -0.142790
2 0.065639 1.065639 -0.934361
3 0.819160 1.819160 -0.180840
4 0.088212 1.088212 -0.911788
EDIT: Please be aware of the huge memory consumption and low speed: https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/ !
Upvotes: 136
Reputation: 3417
Just use result_type="expand"
df = pd.DataFrame(np.random.randint(0,10,(10,2)), columns=["random", "a"])
df[["sq_a","cube_a"]] = df.apply(lambda x: [x.a**2, x.a**3], axis=1, result_type="expand")
Upvotes: 64
Reputation: 686
Have posted the same answer in two other similar questions. The way I prefer to do this is to wrap up the return values of the function in a series:
def f(x):
return pd.Series([x**2, x**3])
And then use apply as follows to create separate columns:
df[['x**2','x**3']] = df.apply(lambda row: f(row['x']), axis=1)
Upvotes: 11
Reputation: 11
you can return the entire row instead of values:
df = df.apply(extract_text_features,axis = 1)
where the function returns the row
def extract_text_features(row):
row['new_col1'] = value1
row['new_col2'] = value2
return row
Upvotes: 1
Reputation: 342
Summary: If you only want to create a few columns, use df[['new_col1','new_col2']] = df[['data1','data2']].apply( function_of_your_choosing(x), axis=1)
For this solution, the number of new columns you are creating must be equal to the number columns you use as input to the .apply() function. If you want to do something else, have a look at the other answers.
Details Let's say you have two-column dataframe. The first column is a person's height when they are 10; the second is said person's height when they are 20.
Suppose you need to calculate both the mean of each person's heights and sum of each person's heights. That's two values per each row.
You could do this via the following, soon-to-be-applied function:
def mean_and_sum(x):
"""
Calculates the mean and sum of two heights.
Parameters:
:x -- the values in the row this function is applied to. Could also work on a list or a tuple.
"""
sum=x[0]+x[1]
mean=sum/2
return [mean,sum]
You might use this function like so:
df[['height_at_age_10','height_at_age_20']].apply(mean_and_sum(x),axis=1)
(To be clear: this apply function takes in the values from each row in the subsetted dataframe and returns a list.)
However, if you do this:
df['Mean_&_Sum'] = df[['height_at_age_10','height_at_age_20']].apply(mean_and_sum(x),axis=1)
you'll create 1 new column that contains the [mean,sum] lists, which you'd presumably want to avoid, because that would require another Lambda/Apply.
Instead, you want to break out each value into its own column. To do this, you can create two columns at once:
df[['Mean','Sum']] = df[['height_at_age_10','height_at_age_20']]
.apply(mean_and_sum(x),axis=1)
Upvotes: 23
Reputation: 3071
This is the correct and easiest way to accomplish this for 95% of use cases:
>>> df = pd.DataFrame(zip(*[range(10)]), columns=['num'])
>>> df
num
0 0
1 1
2 2
3 3
4 4
5 5
>>> def example(x):
... x['p1'] = x['num']**2
... x['p2'] = x['num']**3
... x['p3'] = x['num']**4
... return x
>>> df = df.apply(example, axis=1)
>>> df
num p1 p2 p3
0 0 0 0 0
1 1 1 1 1
2 2 4 8 16
3 3 9 27 81
4 4 16 64 256
Upvotes: 90
Reputation: 62017
The accepted solution is going to be extremely slow for lots of data. The solution with the greatest number of upvotes is a little difficult to read and also slow with numeric data. If each new column can be calculated independently of the others, I would just assign each of them directly without using apply
.
Create 100,000 strings in a DataFrame
df = pd.DataFrame(np.random.choice(['he jumped', 'she ran', 'they hiked'],
size=100000, replace=True),
columns=['words'])
df.head()
words
0 she ran
1 she ran
2 they hiked
3 they hiked
4 they hiked
Let's say we wanted to extract some text features as done in the original question. For instance, let's extract the first character, count the occurrence of the letter 'e' and capitalize the phrase.
df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
df.head()
words first count_e cap
0 she ran s 1 She ran
1 she ran s 1 She ran
2 they hiked t 2 They hiked
3 they hiked t 2 They hiked
4 they hiked t 2 They hiked
Timings
%%timeit
df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
127 ms ± 585 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def extract_text_features(x):
return x[0], x.count('e'), x.capitalize()
%timeit df['first'], df['count_e'], df['cap'] = zip(*df['words'].apply(extract_text_features))
101 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Surprisingly, you can get better performance by looping through each value
%%timeit
a,b,c = [], [], []
for s in df['words']:
a.append(s[0]), b.append(s.count('e')), c.append(s.capitalize())
df['first'] = a
df['count_e'] = b
df['cap'] = c
79.1 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Create 1 million random numbers and test the powers
function from above.
df = pd.DataFrame(np.random.rand(1000000), columns=['num'])
def powers(x):
return x, x**2, x**3, x**4, x**5, x**6
%%timeit
df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
zip(*df['num'].map(powers))
1.35 s ± 83.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Assigning each column is 25x faster and very readable:
%%timeit
df['p1'] = df['num'] ** 1
df['p2'] = df['num'] ** 2
df['p3'] = df['num'] ** 3
df['p4'] = df['num'] ** 4
df['p5'] = df['num'] ** 5
df['p6'] = df['num'] ** 6
51.6 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I made a similar response with more details here on why apply
is typically not the way to go.
Upvotes: 13
Reputation: 19992
I usually do this using zip
:
>>> df = pd.DataFrame([[i] for i in range(10)], columns=['num'])
>>> df
num
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
>>> def powers(x):
>>> return x, x**2, x**3, x**4, x**5, x**6
>>> df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
>>> zip(*df['num'].map(powers))
>>> df
num p1 p2 p3 p4 p5 p6
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 4 8 16 32 64
3 3 3 9 27 81 243 729
4 4 4 16 64 256 1024 4096
5 5 5 25 125 625 3125 15625
6 6 6 36 216 1296 7776 46656
7 7 7 49 343 2401 16807 117649
8 8 8 64 512 4096 32768 262144
9 9 9 81 729 6561 59049 531441
Upvotes: 300
Reputation: 308
I've looked several ways of doing this and the method shown here (returning a pandas series) doesn't seem to be most efficient.
If we start with a largeish dataframe of random data:
# Setup a dataframe of random numbers and create a
df = pd.DataFrame(np.random.randn(10000,3),columns=list('ABC'))
df['D'] = df.apply(lambda r: ':'.join(map(str, (r.A, r.B, r.C))), axis=1)
columns = 'new_a', 'new_b', 'new_c'
The example shown here:
# Create the dataframe by returning a series
def method_b(v):
return pd.Series({k: v for k, v in zip(columns, v.split(':'))})
%timeit -n10 -r3 df.D.apply(method_b)
10 loops, best of 3: 2.77 s per loop
An alternative method:
# Create a dataframe from a series of tuples
def method_a(v):
return v.split(':')
%timeit -n10 -r3 pd.DataFrame(df.D.apply(method_a).tolist(), columns=columns)
10 loops, best of 3: 8.85 ms per loop
By my reckoning it's far more efficient to take a series of tuples and then convert that to a DataFrame. I'd be interested to hear people's thinking though if there's an error in my working.
Upvotes: 14
Reputation: 7022
This is what I've done in the past
df = pd.DataFrame({'textcol' : np.random.rand(5)})
df
textcol
0 0.626524
1 0.119967
2 0.803650
3 0.100880
4 0.017859
df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))
feature1 feature2
0 1.626524 -0.373476
1 1.119967 -0.880033
2 1.803650 -0.196350
3 1.100880 -0.899120
4 1.017859 -0.982141
Editing for completeness
pd.concat([df, df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))], axis=1)
textcol feature1 feature2
0 0.626524 1.626524 -0.373476
1 0.119967 1.119967 -0.880033
2 0.803650 1.803650 -0.196350
3 0.100880 1.100880 -0.899120
4 0.017859 1.017859 -0.982141
Upvotes: 95