Reputation: 4827
I would like to add a column to a pandas DataFrame based on the value in one of the other columns.
import pandas as pd
import numpy as np
Records = 100
df = pd.DataFrame (
{'ID' : range(1, Records + 1),
'Group' : np.random.choice(range(1, 41), Records, replace = True)
}
)
def Age(x):
a = list()
for i in x:
if (i >= 14 and i <= 20) or (i >= 34 and i <= 40):
a.append('65+')
else:
a.append('65-')
return a
df['Age'] = Age(df.Group)
print(df.head(10))
Group ID Age
0 11 1 65-
1 1 2 65-
2 6 3 65-
3 32 4 65-
4 31 5 65-
5 39 6 65+
6 26 7 65-
7 38 8 65+
8 26 9 65-
9 31 10 65-
This does the job, but I prefer using a lambda function if possible but can't get it working. Or if possible create the Age column at creation of the dataframe. Any suggestions?
Upvotes: 1
Views: 3664
Reputation: 294258
Option 1
Rethink the condition.
Notice that both intervals are width 6
.
Notice that the midpoint between intervals is 27
.
cats = np.array(['65-', '65+'])
cond = df.Group.sub(27).abs().pipe(lambda x: x.ge(7) & x.le(13)).astype(int)
df.assign(Age=cats[cond])
Group ID Age
0 11 1 65-
1 1 2 65-
2 6 3 65-
3 32 4 65-
4 31 5 65-
5 39 6 65+
6 26 7 65-
7 38 8 65+
8 26 9 65-
9 31 10 65-
We can quicken this up by using the underlying arrays
cats = np.array(['65-', '65+'])
arr1 = np.abs(df.Group.values - 27)
cond = ((arr1 >= 7) & (arr1 <= 13)).astype(int)
df.assign(Age=cats[cond])
Group ID Age
0 11 1 65-
1 1 2 65-
2 6 3 65-
3 32 4 65-
4 31 5 65-
5 39 6 65+
6 26 7 65-
7 38 8 65+
8 26 9 65-
9 31 10 65-
Option 2
Use np.searchsorted
Use the integer breakpoints of [13, 20, 33, 40]
. searchsorted
will tell us where each Group
value falls and then we slice an array of labels to give us what we want.
b = np.array([13, 20, 33, 40])
c = np.array(['65-', '65+', '65-', '65+', '65-'])
df.assign(Age=c[np.searchsorted(b, df.Group.values)])
Group ID Age
0 11 1 65-
1 1 2 65-
2 6 3 65-
3 32 4 65-
4 31 5 65-
5 39 6 65+
6 26 7 65-
7 38 8 65+
8 26 9 65-
9 31 10 65-
Upvotes: 2
Reputation: 862661
Use numpy.where
what is very fast vectorized function:
m = ((df['Group'] >= 14) & (df['Group'] <= 20)) | ((df['Group'] >= 34) & (df['Group'] <= 40))
df['new'] = np.where(m, '65+','65-')
print (df)
Group ID Age new
0 11 1 65- 65-
1 1 2 65- 65-
2 6 3 65- 65-
3 32 4 65- 65-
4 31 5 65- 65-
5 39 6 65+ 65+
6 26 7 65- 65-
7 38 8 65+ 65+
8 26 9 65- 65-
9 31 10 65- 65-
Timings:
Records = 1000000
In [94]: %timeit df['Age1'] = np.where((df['Group'] >= 14) & (df['Group'] <= 20) | (df['Group'] >= 34) & (df['Group'] <= 40), '65+','65-')
10 loops, best of 3: 123 ms per loop
In [95]: %timeit df['Age2'] = df['Group'].apply(lambda x: '65+' if ((x >= 14 and x <= 20) or (x >= 34 and x <= 40)) else '65-')
1 loop, best of 3: 253 ms per loop
Upvotes: 2
Reputation: 1017
With apply on df.Group
series
Records = 100
df = pd.DataFrame (
{'ID' : range(1, Records + 1),
'Group' : np.random.choice(range(1, 41), Records, replace = True)
}
)
#Here is the apply:
df['Age'] = df['Group'].apply(lambda x: '65+' if ((x >= 14 and x <= 20) or
(x >= 34 and x <= 40)) else '65-')
print(df.head())
Result :
Group ID Age
0 3 1 65-
1 25 2 65-
2 6 3 65-
3 23 4 65-
4 20 5 65+
...
Upvotes: 1