Tou Mou
Tou Mou

Reputation: 1274

How to compute frequency of categorical variables based on a condition

Good afternoon ,

Assume we have the following dataset from UCI :

ballons=structure(list(YELLOW = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("PURPLE", 
"YELLOW"), class = "factor"), SMALL = structure(c(2L, 2L, 2L, 
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L
), .Label = c("LARGE", "SMALL"), class = "factor"), STRETCH = structure(c(2L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 
1L, 1L), .Label = c("DIP", "STRETCH"), class = "factor"), ADULT = structure(c(1L, 
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 
1L, 2L), .Label = c("ADULT", "CHILD"), class = "factor"), T = c(TRUE, 
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, 
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)), class = "data.frame", row.names = c(NA, 
-19L))
 # output :
   YELLOW SMALL STRETCH ADULT     T
1  YELLOW SMALL STRETCH ADULT  TRUE
2  YELLOW SMALL STRETCH CHILD FALSE
3  YELLOW SMALL     DIP ADULT FALSE
4  YELLOW SMALL     DIP CHILD FALSE
5  YELLOW LARGE STRETCH ADULT  TRUE
6  YELLOW LARGE STRETCH ADULT  TRUE
7  YELLOW LARGE STRETCH CHILD FALSE
8  YELLOW LARGE     DIP ADULT FALSE
9  YELLOW LARGE     DIP CHILD FALSE
10 PURPLE SMALL STRETCH ADULT  TRUE
11 PURPLE SMALL STRETCH ADULT  TRUE
12 PURPLE SMALL STRETCH CHILD FALSE
13 PURPLE SMALL     DIP ADULT FALSE
14 PURPLE SMALL     DIP CHILD FALSE
15 PURPLE LARGE STRETCH ADULT  TRUE
16 PURPLE LARGE STRETCH ADULT  TRUE
17 PURPLE LARGE STRETCH CHILD FALSE
18 PURPLE LARGE     DIP ADULT FALSE
19 PURPLE LARGE     DIP CHILD FALSE

Assume also i applied a clustering algorithm to get a results like the following :

clusterss=data.frame(index=1:19,class=c(1,2,3,3,3,2,3,1,2,3,3,2,2,3,2,2,1,1,2))
> clusterss
   index class
1      1     1
2      2     2
3      3     3
4      4     3
5      5     3
6      6     2
7      7     3
8      8     1
9      9     2
10    10     3
11    11     3
12    12     2
13    13     2
14    14     3
15    15     2
16    16     2
17    17     1
18    18     1
19    19     2

Here the index variable represents the ballons rows and the class is the obtained cluster where the ballons row belongs to.

I know that we could compute the frequency of all categorical variables by :

> sapply(ballons,table)
       y1 y2 y3 y4 y5
PURPLE 10 10  8 11 12
YELLOW  9  9 11  8  7

However , i need to compute this for each cluster independently . This means i need ( for each class ) to select their associated observations , After that i can compute the frequencies. For example , with class=1 :

# Expected results for the first cluster : class == 1
result1 <- filter(clusterss, class == 1)
sapply(ballons[result1[,1],],table)
       y1 y2 y3 y4 y5
PURPLE  2  3  2  3  3
YELLOW  2  1  2  1  1
# Expected results for the second cluster : class == 2
result2 <- filter(clusterss, class == 2)
sapply(ballons[result2[,1],],table)
       y1 y2 y3 y4 y5
PURPLE  5  5  3  4  5
YELLOW  3  3  5  4  3
# Expected results for the third cluster : class == 3
result3 <- filter(clusterss, class == 3)
sapply(ballons[result3[,1],],table)
       y1 y2 y3 y4 y5
PURPLE  3  2  3  4  4
YELLOW  4  5  4  3  3

I'm searching an efficient way to obtain such results ( maybe with select function of dplyr ). Thank you for help !

Upvotes: 1

Views: 130

Answers (1)

GKi
GKi

Reputation: 39647

You can give an additional column, here clusterss$class, to table:

sapply(ballons,table, clusterss$class)
#lapply(ballons,table, clusterss$class) #Alternative
#     YELLOW SMALL STRETCH ADULT T
#[1,]      2     3       2     3 3
#[2,]      2     1       2     1 1
#[3,]      5     5       3     4 5
#[4,]      3     3       5     4 3
#[5,]      3     2       3     4 4
#[6,]      4     5       4     3 3

Upvotes: 5

Related Questions