Piyush Shah
Piyush Shah

Reputation: 321

Calculating quintile based scores on R

I have a dataframe with year (2006 to 2010), 4 industry sectors, 150 firm names and the net income of these firms. In total I have 750 observations, one for each firm for each year. I want to give scores to firms for their income within each industry year based on the quintiles. So, firms with income in the top 20% within each industry-year get a score of 5, the next 20% get a score of 4 and so on. The bottom 20% get a score of 1.

The sample data base is:

Year Industry Firm Income
2006 Chemicals ABC 334.50
2007 Chemicals ABC 388.98
.
.
2006 Pharma XYZ 91.45
.
.

How do I do this in R? I have tried aggregate and tapply along with quantile but am not able to arrive at the logic that should be used for this. Please help.

I tried this just to allocate a score of 1 to the lowest 20%, but it returned an error.

db10$score <- ifelse(db10$income < aggregate(income~Year+industry,db10,quantile,c(0.2)),1,0)

Upvotes: 3

Views: 3188

Answers (1)

R. Schifini
R. Schifini

Reputation: 9313

Try this method:

First, I'll create the sample where to test the function below:

y = c(rep(2001,15),rep(2002,15),rep(2003,15))
ind = c("A","B","C","D","E","G","H","I","J","K","L","M","N","O","P")
val = runif(45,10,100)
df = data.frame(y,ind,val)

head(df,20)

      y ind      val
1  2001   A 63.32011
2  2001   B 85.67976
3  2001   C 86.77527
4  2001   D 32.18319
5  2001   E 49.86626
6  2001   G 57.73214
7  2001   H 18.08216
8  2001   I 22.31012
9  2001   J 44.11174
10 2001   K 54.76902
11 2001   L 41.82495
12 2001   M 64.84514
13 2001   N 59.16529
14 2001   O 61.28870
15 2001   P 84.76561
16 2002   A 83.68185
17 2002   B 45.01354
18 2002   C 62.22964
19 2002   D 98.41717
20 2002   E 19.91548

There are 3 years, and industries from A to P. The data frame is ordered by year and later by industry.

This function below takes a year value y and calculates the quintile category for all df$val where the year df$y is y

quintile = function(y) {
    x = df$val[df$y == y]
    qn = quantile(x, probs = (0:5)/5)
    result = as.numeric(cut(x, qn, include.lowest = T))
}

The only thing left is to apply this function to the unique year values

df$qn = unlist(lapply(unique(df$y), quintile))

Result:

> head(df,20)
      y ind      val qn
1  2001   A 63.32011  4
2  2001   B 85.67976  5
3  2001   C 86.77527  5
4  2001   D 32.18319  1
5  2001   E 49.86626  2
6  2001   G 57.73214  3
7  2001   H 18.08216  1
8  2001   I 22.31012  1
9  2001   J 44.11174  2
10 2001   K 54.76902  3
11 2001   L 41.82495  2
12 2001   M 64.84514  4
13 2001   N 59.16529  3
14 2001   O 61.28870  4
15 2001   P 84.76561  5
16 2002   A 83.68185  4
17 2002   B 45.01354  1
18 2002   C 62.22964  3
19 2002   D 98.41717  5
20 2002   E 19.91548  1

Maybe there is a much simpler way to implement this...

Grouping by two columns

If you want to calculate quintiles based on the grouping of two columns: y and grp

y = c(rep(2001,15),rep(2002,15),rep(2003,15))
grp = c("G1","G1","G1","G1","G1","G2","G2","G2","G2","G2","G3","G3","G3","G3","G3")
ind = c("A","B","C","D","E","G","H","I","J","K","L","M","N","O","P")
val = round(runif(45,10,100))
df = data.frame(y,grp,ind,val)

> head(df,20)
      y grp ind val
1  2001  G1   A  40
2  2001  G1   B  33
3  2001  G1   C  65
4  2001  G1   D  99
5  2001  G1   E  18
6  2001  G2   G  36
7  2001  G2   H  15
8  2001  G2   I  17
9  2001  G2   J  42
10 2001  G2   K  67
11 2001  G3   L  60
12 2001  G3   M  34
13 2001  G3   N  61
14 2001  G3   O  76
15 2001  G3   P  15
16 2002  G1   A  18
17 2002  G1   B  15
18 2002  G1   C  44
19 2002  G1   D  79
20 2002  G1   E  22

Then use:

quintile = function(z) {
    x = df$val[df$y == z[1] & df$grp == z[2]]
    qn = quantile(x, probs = (0:5)/5)
    result = as.numeric(cut(x, qn, include.lowest = T))
}


df$qn = as.vector(apply(unique(df[,c("y","grp")]),1, quintile))

Result:

> head(df,20)
      y grp ind val qn
1  2001  G1   A  40  3
2  2001  G1   B  33  2
3  2001  G1   C  65  4
4  2001  G1   D  99  5
5  2001  G1   E  18  1
6  2001  G2   G  36  3
7  2001  G2   H  15  1
8  2001  G2   I  17  2
9  2001  G2   J  42  4
10 2001  G2   K  67  5
11 2001  G3   L  60  3
12 2001  G3   M  34  2
13 2001  G3   N  61  4
14 2001  G3   O  76  5
15 2001  G3   P  15  1
16 2002  G1   A  18  2
17 2002  G1   B  15  1
18 2002  G1   C  44  4
19 2002  G1   D  79  5
20 2002  G1   E  22  3

I this example, y would be the year and grp the industry group, ind the firms and val the income.

Pay attention to the order of c("y","grp") inside the apply and the columns names inside the quintile function. You'll have to replace them with the column names you want.

Be warned that if your groups are small (in this example 5 firms per group), the quintiles may not be unique and an error will pop-up.

Using column names from question

quintile = function(z) {
    x = df$Income[df$Year == z[1] & df$Industry == z[2]]
    qn = quantile(x, probs = (0:5)/5)
    result = as.numeric(cut(x, qn, include.lowest = T))
}


df$qn = as.vector(apply(unique(df[,c("Year","Industry")]),1, quintile))

Before applying this, the data frame df must be ordered by Year and Industry.

Upvotes: 2

Related Questions