R - histogram data in aggreagation

Question

With a data frame like below:

set.seed(100)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
                 prop1 = rep(c("A", "B"), 3),
                 prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
                 prop3=sample(3:6, 6, replace = TRUE))

> df
  id prop1 prop2 prop3
1  2     A FALSE     4
2  2     B  TRUE     4
3  3     A FALSE     6
4  1     B  TRUE     5
5  3     A FALSE     3
6  3     B FALSE     4

I need to do an aggregation by id such that ,for each col prop1 to propN, a histogram data is generated as follows.

For each id,

prop1 need to capture ratio of number of discrete values - "A"s , "B"s for all records with same id which can be accessed via names like prop1[["A"]] & prop1[["B"]]
prop2 need to capture ratio of number of discrete values - "TRUE"s , "FALSE"s for all records with same id which can be accessed via names like prop1[["TRUE"]] & prop1[["FALSE"]]
prop3 need to capture ratio of number of discrete values - "3, 4, 5, 6" for all records with same id which can be accessed via names like prop1[["3"]], prop1[["4"]], prop1[["5"]], prop1[["6"]]

How to get the aggregation for prop1 to propN done in the above format - using base R

Update:Adding output representation.

I'm not certain about the right data type to represent the output and various components in the output. However a spreadsheet view of the output would be as follows. In realty the output desired is in a form such that it can be used as a look-up table for the distribution on an id basis for further computation.

Sotos · Accepted Answer

Here is an idea which uses a custom function defined as follows:

It splits the data frame based on the id and applies the formula (prop.table(table(...))) for finding the ratio. The n acts as an index so as to identify for which column you need the ratio. If n is 2 for example, then fun1 will apply the formula of finding the ratio to column 2 for each element of the list (effectively for each id). Finally, we apply the function via looping through 2:ncol(df) (so in your case 2:4) in order to get the ratio for all columns of interest, for each id.

#convert to factors to make sure you will get 0 frequencies with table as well
df[-1] <- lapply(df[-1], as.factor)

fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i) 
                                                         prop.table(table(i[,n])))))}

data.frame(id = unique(sort(df$id)), 
           do.call(cbind, sapply(2:ncol(df), function(i)fun1(df, i))))

#   id        A         B FALSE. TRUE.        X3        X4  X5       X6
#1  1 0.0000000 1.0000000    0.0   1.0 0.0000000 0.0000000  1 0.0000000
#2  2 0.5000000 0.5000000    0.5   0.5 0.0000000 1.0000000  0 0.0000000
#3  3 0.6666667 0.3333333    1.0   0.0 0.3333333 0.3333333  0 0.3333333

Another way to structure this, would be to create a list and name each element of the list with the column names of your original df. i.e.

l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])

#so you can extract each one separately,

l1[['prop1']]
#          A         B
#1 0.0000000 1.0000000
#2 0.5000000 0.5000000
#3 0.6666667 0.3333333

R - histogram data in aggreagation

Answers (2)

Related Questions