Reputation: 393
I am trying bin categorical columns programtically - any idea on how I can achieve this without manually hard-coding each value in that column
Essentially, what I would like is a function whereby it counts all values up to 80% [leaves the city name as is] and replaces the remaining 20% of city names with the word 'Other'
IE: if the first 17 city names make up 80% of that column, keep the city name as is, else return 'other'.
EG:
0 Brighton
1 Yokohama
2 Levin
3 Melbourne
4 Coffeyville
5 Whakatane
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Te Awamutu
11 Bishkek
12 Melbourne
13 Whanganui
14 Coffeyville
15 New York
16 Brisbane
17 Greymouth
18 Brisbane
19 Chuo City
20 Accra
21 Levin
22 Waiouru
23 Brisbane
24 New York
25 Chuo City
26 Lucerne
27 Whanganui
28 Los Angeles
29 Melbourne
df['city'].head(30).value_counts(ascending=False, normalize=True)*100
Melbourne 16.666667
Levin 10.000000
Brisbane 10.000000
Whanganui 6.666667
Coffeyville 6.666667
New York 6.666667
Chuo City 6.666667
Waiouru 3.333333
Greymouth 3.333333
Te Awamutu 3.333333
Bishkek 3.333333
Lucerne 3.333333
Ashburn 3.333333
Yokohama 3.333333
Whakatane 3.333333
Accra 3.333333
Brighton 3.333333
Los Angeles 3.333333
From Ashburn down - it should be renamed to 'other'
I have tried the below which is a start, but not exactly what I want:
city_map = dict(df['city'].value_counts(ascending=False, normalize=True)*100)
df['city_count']= df['city'].map(city_map)
def count(df):
if df["city_count"] > 10:
return "High"
elif df["city_count"] < 0:
return "Medium"
else:
return "Low"
df.apply(count, axis=1)
I'm not expecting any code - just some guidance on where to start or ideas on how I can achieve this
Upvotes: 0
Views: 453
Reputation: 42926
We can groupby on city
and get the size
of each city. We divide those values by the length of our dataframe with len
and calculate the cumsum
. Last step is to check from which point we exceed the threshold, so we can broadcast the boolean series back to your dataframe with map
.
threshold = 0.7
m = df['city'].map(df.groupby('city')['city'].size().sort_values(ascending=False).div(len(df)).cumsum().le(threshold))
df['city'] = np.where(m, df['city'], 'Other')
city
0 Other
1 Other
2 Levin
3 Melbourne
4 Coffeyville
5 Other
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Other
11 Bishkek
12 Melbourne
13 Other
14 Coffeyville
15 New York
16 Brisbane
17 Other
18 Brisbane
19 Chuo City
20 Other
21 Levin
22 Other
23 Brisbane
24 New York
25 Chuo City
26 Other
27 Other
28 Other
29 Melbourne
old method
If I understand you correctly you want calculate a cumulative sum with .cumsum
and check when it exceeds your set threshold.
Then we use np.where
to conditionally fill in the City name
or Other
.
threshold = 80
m = df['Normalized'].cumsum().le(threshold)
df['City'] = np.where(m, df['City'], 'Other')
City Normalized
0 Auckland 40.399513
1 Christchurch 13.130783
2 Wellington 12.267604
3 Hamilton 4.026242
4 Tauranga 3.867353
5 (not set) 3.540075
6 Dunedin 2.044508
7 Other 1.717975
8 Other 1.632849
9 Other 1.520342
10 Other 1.255651
11 Other 1.173878
12 Other 1.040508
13 Other 0.988166
14 Other 0.880502
15 Other 0.766877
16 Other 0.601468
17 Other 0.539067
18 Other 0.471824
19 Other 0.440903
20 Other 0.440344
21 Other 0.405884
22 Other 0.365836
23 Other 0.321131
24 Other 0.306602
25 Other 0.280524
26 Other 0.237123
27 Other 0.207878
28 Other 0.186084
29 Other 0.167085
30 Other 0.163732
31 Other 0.154977
Note: this method assumed that your Normalized
column is sorted descending.
Upvotes: 3