Reputation: 1274
Good afternoon ,
Assume we have the following dataset from UCI :
ballons=structure(list(YELLOW = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("PURPLE",
"YELLOW"), class = "factor"), SMALL = structure(c(2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L
), .Label = c("LARGE", "SMALL"), class = "factor"), STRETCH = structure(c(2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
1L, 1L), .Label = c("DIP", "STRETCH"), class = "factor"), ADULT = structure(c(1L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,
1L, 2L), .Label = c("ADULT", "CHILD"), class = "factor"), T = c(TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)), class = "data.frame", row.names = c(NA,
-19L))
# output :
YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH ADULT TRUE
2 YELLOW SMALL STRETCH CHILD FALSE
3 YELLOW SMALL DIP ADULT FALSE
4 YELLOW SMALL DIP CHILD FALSE
5 YELLOW LARGE STRETCH ADULT TRUE
6 YELLOW LARGE STRETCH ADULT TRUE
7 YELLOW LARGE STRETCH CHILD FALSE
8 YELLOW LARGE DIP ADULT FALSE
9 YELLOW LARGE DIP CHILD FALSE
10 PURPLE SMALL STRETCH ADULT TRUE
11 PURPLE SMALL STRETCH ADULT TRUE
12 PURPLE SMALL STRETCH CHILD FALSE
13 PURPLE SMALL DIP ADULT FALSE
14 PURPLE SMALL DIP CHILD FALSE
15 PURPLE LARGE STRETCH ADULT TRUE
16 PURPLE LARGE STRETCH ADULT TRUE
17 PURPLE LARGE STRETCH CHILD FALSE
18 PURPLE LARGE DIP ADULT FALSE
19 PURPLE LARGE DIP CHILD FALSE
Assume also i applied a clustering algorithm to get a results like the following :
clusterss=data.frame(index=1:19,class=c(1,2,3,3,3,2,3,1,2,3,3,2,2,3,2,2,1,1,2))
> clusterss
index class
1 1 1
2 2 2
3 3 3
4 4 3
5 5 3
6 6 2
7 7 3
8 8 1
9 9 2
10 10 3
11 11 3
12 12 2
13 13 2
14 14 3
15 15 2
16 16 2
17 17 1
18 18 1
19 19 2
Here the index
variable represents the ballons
rows and the class
is the obtained cluster where the ballons
row belongs to.
I know that we could compute the frequency of all categorical variables by :
> sapply(ballons,table)
y1 y2 y3 y4 y5
PURPLE 10 10 8 11 12
YELLOW 9 9 11 8 7
However , i need to compute this for each cluster independently . This means i need ( for each class ) to select their associated observations , After that i can compute the frequencies. For example , with class=1 :
# Expected results for the first cluster : class == 1
result1 <- filter(clusterss, class == 1)
sapply(ballons[result1[,1],],table)
y1 y2 y3 y4 y5
PURPLE 2 3 2 3 3
YELLOW 2 1 2 1 1
# Expected results for the second cluster : class == 2
result2 <- filter(clusterss, class == 2)
sapply(ballons[result2[,1],],table)
y1 y2 y3 y4 y5
PURPLE 5 5 3 4 5
YELLOW 3 3 5 4 3
# Expected results for the third cluster : class == 3
result3 <- filter(clusterss, class == 3)
sapply(ballons[result3[,1],],table)
y1 y2 y3 y4 y5
PURPLE 3 2 3 4 4
YELLOW 4 5 4 3 3
I'm searching an efficient way to obtain such results ( maybe with select
function of dplyr
).
Thank you for help !
Upvotes: 1
Views: 130
Reputation: 39647
You can give an additional column, here clusterss$class
, to table
:
sapply(ballons,table, clusterss$class)
#lapply(ballons,table, clusterss$class) #Alternative
# YELLOW SMALL STRETCH ADULT T
#[1,] 2 3 2 3 3
#[2,] 2 1 2 1 1
#[3,] 5 5 3 4 5
#[4,] 3 3 5 4 3
#[5,] 3 2 3 4 4
#[6,] 4 5 4 3 3
Upvotes: 5