codelearner0504
codelearner0504

Reputation: 71

Generate a new categorical variable using count()

I have a dataframe similar to this one:

col_a col_b
A         1
B         6
B         3
C         2
C         3
D         6
E         7
F         8
E         8

I want to create a column c and do a cumulative count of everything larger than 1, 2, 3, 4, 5, 6, 7, and 8.

for example, for row 1, the number is 2, so I count total number larger or equal to 2 in col b for row 2, the number is 6, so I count total number larger than or equal to 6 in col b

the returned col should be something like this 
col_c
8 (total count of col_b value that is larger or equal to 1)
5 (total count of col_b value that is larger or equal to 6)
6 (total count of col_b value that is larger or equal to 3)
7 (total count of col_b value that is larger or equal to 2)

My code looks like this :

df.loc[df['col_b'] >= 1, 'group'] =  df[df['col_b'] >=8].count() 
df.loc[df['col_b'] >= 2, 'group'] =  df[df['col_b'] >=8].count() 
df.loc[df['col_b'] >= 3, 'group'] =  df[df['col_b'] >=8].count() 
df.loc[df['col_b'] >= 4, 'group'] =  df[df['col_b'] >=8].count() 
df.loc[df['col_b'] >= 5, 'group'] =  df[df['col_b'] >=8].count() 

Is there anyway to make this easier? Also, my return is NA, instead of an actual count?

Upvotes: 1

Views: 77

Answers (3)

Inputvector
Inputvector

Reputation: 1093

You should use sum instead of count and lambda function:

df = pd.DataFrame(
{
    'col_a': ['A','B','B','C','C','D','E','F','E'],
    'col_b': [1,6,3,2,3,6,7,8,8]
})
    

df['group'] = df.apply(lambda row : (df['col_b'] >= row['col_b']).sum() ,axis = 1)

Result:


  col_a col_b group
0   A   1   9
1   B   6   5
2   B   3   7
3   C   2   8
4   C   3   7
5   D   6   5
6   E   7   3
7   F   8   2
8   E   8   2

Upvotes: 0

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

df['col_c'] = df['col_b'].apply(lambda x: sum(i >= x for i in df['col_b'].tolist()))

Output:

|    | col_a   |   col_b |   col_c |
|---:|:--------|--------:|--------:|
|  0 | A       |       1 |       9 |
|  1 | B       |       6 |       5 |
|  2 | B       |       3 |       7 |
|  3 | C       |       2 |       8 |
|  4 | C       |       3 |       7 |
|  5 | D       |       6 |       5 |
|  6 | E       |       7 |       3 |
|  7 | F       |       8 |       2 |
|  8 | E       |       8 |       2 |

Upvotes: 1

Burrito
Burrito

Reputation: 1624

Looks like you what you want is essentially a rank:

df['col_c'] = df['col_b'].rank(ascending=False, method='max')

    col_a   col_b   col_c
0   A       1       9.0
1   B       6       5.0
2   B       3       7.0
3   C       2       8.0
4   C       3       7.0
5   D       6       5.0
6   E       7       3.0
7   F       8       2.0
8   E       8       2.0

You can subtract 1 to exclude the row itself.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rank.html

Upvotes: 0

Related Questions