Reputation: 131
Let's say I have a sample of N
individuals and a random variable X
which represent their annual income in a foreign currency. An example of X
could be the following:
15000
11000
9000
4000
4000
3900
3800
3600
3400
1000
900
800
700
700
400
300
300
300
200
100
Now I should "sample" the 20
components of X
in 3 "ordered" sub-groups (non necessary with the same number of components) so that they have (approximately) the same Gini Coefficient.
As a reminder for the Gini coefficient: just calculate the % of each income over the total income (ex
p1=1500/(1500+1100+...)
,p2=1100/(1500+1100+...)
, ...,p20=100/(1500+1100+...)
), then the cumulative % values (exc1=0+p1
,c2=p1+p2
, ...,c20=p19+p20=1
), then calculate the area underlying the cumulative (A=(c1+...+c20-0.5)/(20)-0.5
) and therefore the GiniG=2*A
.
This can easily be done by brute force: divide the sample in 3, calculate the Gini for the three samples and try to move from/to the middle sample upper and lower components to se whether differences in terms of Gini improve or worsen off. However, is very time consuming to be done manually (on Excel for example), especially when I have a very big data set.
I suspect there is a more elegant solution. I'm open to both Python
and R
.
ADDITIONAL DETAILS
The output would be something like this: for X
1 2 3
1500 3900 400
1100 3800 300
9000 3600 300
4000 3400 300
1000 200
900 100
800
700
700
for G
, the actual Gini coefficient of the three subgroups
1 2 3
0.4 0.41 0.39
Upvotes: 1
Views: 188
Reputation: 131
It's not very polite to answer its own question, but I think it's worth sharing it. This is what I wrote in R
by taking inspiration from Peter Ellis answer above. Any comment/improvement ideas are welcome:
library(ineq)
x <-c(15000, 11000, 9000, 4000, 4000, 3900, 3800, 3600, 3400,
1000, 900, 800, 700, 700, 400, 300, 300, 300, 200, 100)
n <- length(x)
best_sd <- 1
for(d in 2:n-2) for(u in 3:n-2){
g <- c(Gini(x[1:d]), Gini(x[d+1:u]), Gini(x[u+1:n]))
s <- sd(g)
if(s < best_sd){
best_sd <- s
best_grouping <- c(d,u)
best_g <- g
}
}
best_sd
#[1] 0.005250825
best_grouping
#[1] 9 11
best_g
#[1] 0.3046409 0.3144654 0.3127660
Upvotes: 0
Reputation: 5904
OK here's a method in R that at least automates the brute force. It tries 1,000 different random permutations of the population and picks the one when the Gini coefficients have the lowest standard deviation. It works well and practically instantly with your toy dataset.
library(ineq)
x <-c(1500, 1100, 9000, 4000, 4000, 3900, 3800, 3600, 3400,
1000, 900, 800, 700, 700, 400, 300, 300, 300, 200, 100)
Gini(x)
# 0.534
n <- length(x)
best_sd <- 1
for(i in 1:1000){
grouping <- sample(1:3, n, replace = TRUE)
ginis <- tapply(x, grouping, Gini)
s <- sd(ginis)
if(s < best_sd){
best_sd <- s
best_grouping <- grouping
best_i <- i}
}
best_sd
# 0.000891497
tapply(x, best_grouping, Gini)
# 1 2 3
# 0.5052780 0.5042017 0.5035088
It's not guaranteed to be the best but it obviously is fairly close. A more elegant solution would find ways of picking and choosing which points to swap around as it gets close, but that would probably slow it computationally, and certainly take much more developer time!
With a larger dataset of 100,000 observations it still only takes 12 seconds on my laptop, so it scales up ok.
Upvotes: 1