toyo10
toyo10

Reputation: 131

Find 3 subsample with the same (approximately) Gini coefficient

Let's say I have a sample of N individuals and a random variable X which represent their annual income in a foreign currency. An example of X could be the following:

15000
11000
9000
4000
4000
3900
3800
3600
3400
1000
900
800
700
700
400
300
300
300
200
100

Now I should "sample" the 20 components of X in 3 "ordered" sub-groups (non necessary with the same number of components) so that they have (approximately) the same Gini Coefficient.

As a reminder for the Gini coefficient: just calculate the % of each income over the total income (ex p1=1500/(1500+1100+...), p2=1100/(1500+1100+...), ..., p20=100/(1500+1100+...)), then the cumulative % values (ex c1=0+p1, c2=p1+p2, ..., c20=p19+p20=1), then calculate the area underlying the cumulative (A=(c1+...+c20-0.5)/(20)-0.5) and therefore the Gini G=2*A.

This can easily be done by brute force: divide the sample in 3, calculate the Gini for the three samples and try to move from/to the middle sample upper and lower components to se whether differences in terms of Gini improve or worsen off. However, is very time consuming to be done manually (on Excel for example), especially when I have a very big data set.

I suspect there is a more elegant solution. I'm open to both Python and R.

ADDITIONAL DETAILS The output would be something like this: for X

        1         2         3 
     1500      3900       400
     1100      3800       300
     9000      3600       300
     4000      3400       300
               1000       200
                900       100
                800
                700
                700

for G, the actual Gini coefficient of the three subgroups

        1         2         3 
      0.4      0.41      0.39 

Upvotes: 1

Views: 188

Answers (2)

toyo10
toyo10

Reputation: 131

It's not very polite to answer its own question, but I think it's worth sharing it. This is what I wrote in R by taking inspiration from Peter Ellis answer above. Any comment/improvement ideas are welcome:

library(ineq)
x <-c(15000, 11000, 9000, 4000, 4000, 3900, 3800, 3600, 3400,
      1000, 900, 800, 700, 700, 400, 300, 300, 300, 200, 100)
n <- length(x)

best_sd <- 1
for(d in 2:n-2) for(u in 3:n-2){
  g <- c(Gini(x[1:d]), Gini(x[d+1:u]), Gini(x[u+1:n]))
  s <- sd(g) 
  if(s < best_sd){
    best_sd <- s
    best_grouping <- c(d,u)
    best_g <- g
  }
}

best_sd
#[1] 0.005250825
best_grouping
#[1]  9 11
best_g
#[1] 0.3046409 0.3144654 0.3127660

Upvotes: 0

Peter Ellis
Peter Ellis

Reputation: 5904

OK here's a method in R that at least automates the brute force. It tries 1,000 different random permutations of the population and picks the one when the Gini coefficients have the lowest standard deviation. It works well and practically instantly with your toy dataset.

library(ineq)

x <-c(1500, 1100, 9000, 4000, 4000, 3900, 3800, 3600, 3400,
      1000, 900, 800, 700, 700, 400, 300, 300, 300, 200, 100)

Gini(x)
# 0.534

n <- length(x)


best_sd <- 1

for(i in 1:1000){
  grouping <- sample(1:3, n, replace = TRUE)
  ginis <- tapply(x, grouping, Gini)
  s <- sd(ginis)
  if(s < best_sd){
    best_sd <- s
    best_grouping <- grouping
    best_i <- i}
}

best_sd
# 0.000891497

tapply(x, best_grouping, Gini)
#         1         2         3 
# 0.5052780 0.5042017 0.5035088 

It's not guaranteed to be the best but it obviously is fairly close. A more elegant solution would find ways of picking and choosing which points to swap around as it gets close, but that would probably slow it computationally, and certainly take much more developer time!

With a larger dataset of 100,000 observations it still only takes 12 seconds on my laptop, so it scales up ok.

Upvotes: 1

Related Questions