Reputation: 4840
I have a list of citie names Which I want to categorize based on their frequencies. I first wanted to to use binning, but as this required monotone spacing I gave up on that. A next, maybe even better, approach would be create categories based on quantiles based on the frequencies using pandas.qcut. But having the quantiles, I have no idea how to create an additional column based on their quantile. Example:
import numpy as np
import pandas as pd
np.random.seed(0)
cities = np.random.choice(['Ontario', 'Ottawa', 'Vancouver','Edmonton',
'Winnipeg', 'Churchill'], 500)
# Create fake data and their frequencies
df = pd.DataFrame (cities, columns=['City'])
freq = df['City'].value_counts()
print (freq)
# Create quantiles
qc = pd.qcut (freq, 3)
print (qc)
# And now? I have the quantiles but how to assign a categorie to each City?
category_for_each_city = df['City'] in qc # does not work, but many other things neither
I tried many things but none of it worked. I should be able to write a loop for this but I cannot imagine that's the Python way. I tried looking for some sklearn Transformers but could not find any with this specific solution. Any help would be greatly appreciated.
Additionally, I have many skewed distributions, a solution that can be extended to for example a log transformation would be of great help.
Upvotes: 1
Views: 220
Reputation: 210832
You were almost there...
In [106]: category_for_each_city = df['City'].map(qc)
In [107]: category_for_each_city
Out[107]:
0 (77.333, 84.667]
1 (72.999, 77.333]
2 (84.667, 100.0]
3 (84.667, 100.0]
4 (84.667, 100.0]
5 (84.667, 100.0]
6 (77.333, 84.667]
...
493 (84.667, 100.0]
494 (72.999, 77.333]
495 (77.333, 84.667]
496 (84.667, 100.0]
497 (77.333, 84.667]
498 (77.333, 84.667]
499 (77.333, 84.667]
Name: City, Length: 500, dtype: category
Categories (3, interval[float64]): [(72.999, 77.333] < (77.333, 84.667] < (84.667, 100.0]]
UPDATE:
In [114]: qc = pd.qcut (freq, 3, labels=[0,1,2])
In [115]: category_for_each_city = df['City'].map(qc)
In [116]: category_for_each_city
Out[116]:
0 1
1 0
2 2
3 2
4 2
5 2
6 1
..
493 2
494 0
495 1
496 2
497 1
498 1
499 1
Name: City, Length: 500, dtype: category
Categories (3, int64): [0 < 1 < 2]
Upvotes: 1