Reputation: 1440
I have a pandas series like this:
0 1787
1 4789
2 1350
3 1476
4 0
5 747
6 307
7 147
8 221
9 -88
10 9374
11 264
12 1109
13 502
14 360
15 194
16 4073
17 2317
18 -221
20 0
21 16
22 106
29 105
30 4189
31 171
32 42
I want to create 4 one hot encoded variables that indicates which value per row is on which quartile, dividing the series into 4 quartiles. It would be something like this:
0 1787 Q1 Q2 Q3 Q4
1 4789 0 0 0 0
2 1350 0 0 0 1
3 1476 1 0 0 0
4 0 0 1 0 0
5 747 0 0 1 0
6 307 1 0 1 0
7 147 0 1 0 1
I know the numbers do not exactly match, this is is just to give a visual example of the desired output.
I have tried this:
series.quantile[0.25, 0.5, 0.75, 1]
But this only yells this four values:
0.25 67
0.50 442.5
0.75 1477.75
1.00 71188
I have also tried this:
series.apply(lambda x : series.quantile(x, 'lower'))
However this is giving the following error:
ValueError: percentiles should all be in the interval [0, 1]. Try 17.87 instead
.
What's the best way to accomplisg my goal??
Thank you very much in advance
Upvotes: 1
Views: 1544
Reputation: 3902
The following code featuring pandas.qcut and pandas.get_dummies should do
quantiles = pd.qcut(series,
[0, 0.25, 0.5, 0.75, 1],
labels=['Q1', 'Q2', 'Q3', 'Q4'])
dummies = pd.get_dummies(quantiles)
pd.concat([df, dummies], axis=1)
resuling in
Series Q1 Q2 Q3 Q4
0 1787 0 0 0 1
1 4789 0 0 0 1
2 1350 0 0 1 0
3 1476 0 0 0 1
4 0 1 0 0 0
5 747 0 0 1 0
6 307 0 0 1 0
7 147 0 1 0 0
8 221 0 1 0 0
9 -88 1 0 0 0
10 9374 0 0 0 1
11 264 0 1 0 0
12 1109 0 0 1 0
13 502 0 0 1 0
14 360 0 0 1 0
15 194 0 1 0 0
16 4073 0 0 0 1
17 2317 0 0 0 1
18 -221 1 0 0 0
20 0 1 0 0 0
21 16 1 0 0 0
22 106 0 1 0 0
29 105 1 0 0 0
30 4189 0 0 0 1
31 171 0 1 0 0
32 42 1 0 0 0
Upvotes: 1
Reputation: 2239
I think you can try this.
df = pd.DataFrame({'Series': series})
quantiles = df['Series'].quantile([0, 0.25, 0.5, 0.75, 1]).to_frame('quantiles').reset_index(drop = True)
for quant, Q in enumerate(['Q1', 'Q2', 'Q3', 'Q4']):
quant = quant + 1
df.loc[:,Q] = np.where((df.Series > quantiles.quantiles[quant-1]) & (df.Series <= quantiles.quantiles[quant]), 1, 0)
that should give you this:
Series Q1 Q2 Q3 Q4
0 1787 0 0 0 1
1 4789 0 0 0 1
2 1350 0 0 1 0
3 1476 0 0 0 1
4 0 1 0 0 0
5 747 0 0 1 0
6 307 0 0 1 0
7 147 0 1 0 0
8 221 0 1 0 0
9 -88 1 0 0 0
10 9374 0 0 0 1
11 264 0 1 0 0
12 1109 0 0 1 0
13 502 0 0 1 0
14 360 0 0 1 0
15 194 0 1 0 0
16 4073 0 0 0 1
17 2317 0 0 0 1
18 -221 0 0 0 0
19 0 1 0 0 0
20 16 1 0 0 0
21 106 0 1 0 0
22 105 1 0 0 0
23 4189 0 0 0 1
24 171 0 1 0 0
25 42 1 0 0 0
Upvotes: 1