user3206440
user3206440

Reputation: 5049

R - histogram data in aggreagation

With a data frame like below:

set.seed(100)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
                 prop1 = rep(c("A", "B"), 3),
                 prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
                 prop3=sample(3:6, 6, replace = TRUE))

> df
  id prop1 prop2 prop3
1  2     A FALSE     4
2  2     B  TRUE     4
3  3     A FALSE     6
4  1     B  TRUE     5
5  3     A FALSE     3
6  3     B FALSE     4

I need to do an aggregation by id such that ,for each col prop1 to propN, a histogram data is generated as follows.

For each id,

How to get the aggregation for prop1 to propN done in the above format - using base R

Update:Adding output representation.

I'm not certain about the right data type to represent the output and various components in the output. However a spreadsheet view of the output would be as follows. In realty the output desired is in a form such that it can be used as a look-up table for the distribution on an id basis for further computation.

spreadsheet view of output

Upvotes: 1

Views: 83

Answers (2)

Sotos
Sotos

Reputation: 51592

Here is an idea which uses a custom function defined as follows:

It splits the data frame based on the id and applies the formula (prop.table(table(...))) for finding the ratio. The n acts as an index so as to identify for which column you need the ratio. If n is 2 for example, then fun1 will apply the formula of finding the ratio to column 2 for each element of the list (effectively for each id). Finally, we apply the function via looping through 2:ncol(df) (so in your case 2:4) in order to get the ratio for all columns of interest, for each id.

#convert to factors to make sure you will get 0 frequencies with table as well
df[-1] <- lapply(df[-1], as.factor)

fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i) 
                                                         prop.table(table(i[,n])))))}

data.frame(id = unique(sort(df$id)), 
           do.call(cbind, sapply(2:ncol(df), function(i)fun1(df, i))))

#   id        A         B FALSE. TRUE.        X3        X4  X5       X6
#1  1 0.0000000 1.0000000    0.0   1.0 0.0000000 0.0000000  1 0.0000000
#2  2 0.5000000 0.5000000    0.5   0.5 0.0000000 1.0000000  0 0.0000000
#3  3 0.6666667 0.3333333    1.0   0.0 0.3333333 0.3333333  0 0.3333333

Another way to structure this, would be to create a list and name each element of the list with the column names of your original df. i.e.

l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])

#so you can extract each one separately,

l1[['prop1']]
#          A         B
#1 0.0000000 1.0000000
#2 0.5000000 0.5000000
#3 0.6666667 0.3333333

Upvotes: 2

Captain Tyler
Captain Tyler

Reputation: 594

I think you want this:

library(reshape)
df[-1] <- lapply(df[-1],as.factor)
# second, rearrange vars in a named vector
df <- melt(df,id=c("id"),variable_name = "prop")
df$prop <- as.factor(df$prop)

#third, make the histograms with ggplot2
library(ggplot2)

h <- ggplot(df,aes(x=id)) 
h + geom_bar(stat="count", aes(fill=id)) + facet_grid(~ prop + value) 

Upvotes: 1

Related Questions