René
René

Reputation: 4827

Add column to a Python pandas DataFrame based on values in an other column

I would like to add a column to a pandas DataFrame based on the value in one of the other columns.

import pandas as pd
import numpy as np

Records = 100

df = pd.DataFrame (
        {'ID' : range(1, Records + 1),
         'Group' : np.random.choice(range(1, 41), Records, replace = True)
         }
        )

def Age(x):
    a = list()
    for i in x:
        if (i >= 14 and i <= 20) or (i >= 34 and i <= 40):
            a.append('65+')
        else:
            a.append('65-')
    return a

df['Age'] = Age(df.Group)

print(df.head(10))

    Group  ID  Age
0      11   1  65-
1       1   2  65-
2       6   3  65-
3      32   4  65-
4      31   5  65-
5      39   6  65+
6      26   7  65-
7      38   8  65+
8      26   9  65-
9      31  10  65-

This does the job, but I prefer using a lambda function if possible but can't get it working. Or if possible create the Age column at creation of the dataframe. Any suggestions?

Upvotes: 1

Views: 3664

Answers (3)

piRSquared
piRSquared

Reputation: 294258

Option 1
Rethink the condition.
Notice that both intervals are width 6.
Notice that the midpoint between intervals is 27.

cats = np.array(['65-', '65+'])
cond = df.Group.sub(27).abs().pipe(lambda x: x.ge(7) & x.le(13)).astype(int)
df.assign(Age=cats[cond])

   Group  ID  Age
0     11   1  65-
1      1   2  65-
2      6   3  65-
3     32   4  65-
4     31   5  65-
5     39   6  65+
6     26   7  65-
7     38   8  65+
8     26   9  65-
9     31  10  65-

We can quicken this up by using the underlying arrays

cats = np.array(['65-', '65+'])
arr1 = np.abs(df.Group.values - 27)
cond = ((arr1 >= 7) & (arr1 <= 13)).astype(int)
df.assign(Age=cats[cond])

   Group  ID  Age
0     11   1  65-
1      1   2  65-
2      6   3  65-
3     32   4  65-
4     31   5  65-
5     39   6  65+
6     26   7  65-
7     38   8  65+
8     26   9  65-
9     31  10  65-

Option 2
Use np.searchsorted
Use the integer breakpoints of [13, 20, 33, 40]. searchsorted will tell us where each Group value falls and then we slice an array of labels to give us what we want.

b = np.array([13, 20, 33, 40])
c = np.array(['65-', '65+', '65-', '65+', '65-'])
df.assign(Age=c[np.searchsorted(b, df.Group.values)])

   Group  ID  Age
0     11   1  65-
1      1   2  65-
2      6   3  65-
3     32   4  65-
4     31   5  65-
5     39   6  65+
6     26   7  65-
7     38   8  65+
8     26   9  65-
9     31  10  65-

Upvotes: 2

jezrael
jezrael

Reputation: 862661

Use numpy.where what is very fast vectorized function:

m = ((df['Group'] >= 14) & (df['Group'] <= 20)) | ((df['Group'] >= 34) & (df['Group'] <= 40))
df['new'] = np.where(m, '65+','65-')
print (df)
   Group  ID  Age  new
0     11   1  65-  65-
1      1   2  65-  65-
2      6   3  65-  65-
3     32   4  65-  65-
4     31   5  65-  65-
5     39   6  65+  65+
6     26   7  65-  65-
7     38   8  65+  65+
8     26   9  65-  65-
9     31  10  65-  65-

Timings:

Records = 1000000

In [94]: %timeit df['Age1'] = np.where((df['Group'] >= 14) & (df['Group'] <= 20) | (df['Group'] >= 34) & (df['Group'] <= 40), '65+','65-')
10 loops, best of 3: 123 ms per loop

In [95]: %timeit df['Age2'] = df['Group'].apply(lambda x: '65+' if ((x >= 14 and x <= 20) or (x >= 34 and x <= 40)) else '65-')
1 loop, best of 3: 253 ms per loop

Upvotes: 2

David Leon
David Leon

Reputation: 1017

With apply on df.Group series

Records = 100

df = pd.DataFrame (
        {'ID' : range(1, Records + 1),
         'Group' : np.random.choice(range(1, 41), Records, replace = True)
         }
        )

#Here is the apply:
df['Age'] = df['Group'].apply(lambda x: '65+' if ((x >= 14 and x <= 20) or
                                                  (x >= 34 and x <= 40)) else '65-')
print(df.head())

Result :

    Group ID  Age
0      3   1  65-
1     25   2  65-
2      6   3  65-
3     23   4  65-
4     20   5  65+
...

Upvotes: 1

Related Questions