Reputation: 5049
With a data frame like below:
set.seed(100)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
prop1 = rep(c("A", "B"), 3),
prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
prop3=sample(3:6, 6, replace = TRUE))
> df
id prop1 prop2 prop3
1 2 A FALSE 4
2 2 B TRUE 4
3 3 A FALSE 6
4 1 B TRUE 5
5 3 A FALSE 3
6 3 B FALSE 4
I need to do an aggregation by id
such that ,for each col prop1
to propN
, a histogram data is generated as follows.
For each id
,
prop1
need to capture ratio of number of discrete values - "A"
s , "B"
s for all records with same id
which can be accessed via names like prop1[["A"]]
& prop1[["B"]]
prop2
need to capture ratio of number of discrete values - "TRUE"
s , "FALSE"
s for all records with same id
which can be accessed via names like prop1[["TRUE"]]
& prop1[["FALSE"]]
prop3
need to capture ratio of number of discrete values - "3, 4, 5, 6"
for all records with same id
which can be accessed via names like prop1[["3"]]
, prop1[["4"]]
, prop1[["5"]]
, prop1[["6"]]
How to get the aggregation for prop1
to propN
done in the above format - using base R
Update:Adding output representation.
I'm not certain about the right data type to represent the output and various components in the output. However a spreadsheet view of the output would be as follows. In realty the output desired is in a form such that it can be used as a look-up table for the distribution on an id
basis for further computation.
Upvotes: 1
Views: 83
Reputation: 51592
Here is an idea which uses a custom function defined as follows:
It splits the data frame based on the id
and applies the formula (prop.table(table(...))
) for finding the ratio. The n
acts as an index so as to identify for which column you need the ratio. If n
is 2 for example, then fun1
will apply the formula of finding the ratio to column 2 for each element of the list (effectively for each id
). Finally, we apply the function via looping through 2:ncol(df)
(so in your case 2:4) in order to get the ratio for all columns of interest, for each id.
#convert to factors to make sure you will get 0 frequencies with table as well
df[-1] <- lapply(df[-1], as.factor)
fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i)
prop.table(table(i[,n])))))}
data.frame(id = unique(sort(df$id)),
do.call(cbind, sapply(2:ncol(df), function(i)fun1(df, i))))
# id A B FALSE. TRUE. X3 X4 X5 X6
#1 1 0.0000000 1.0000000 0.0 1.0 0.0000000 0.0000000 1 0.0000000
#2 2 0.5000000 0.5000000 0.5 0.5 0.0000000 1.0000000 0 0.0000000
#3 3 0.6666667 0.3333333 1.0 0.0 0.3333333 0.3333333 0 0.3333333
Another way to structure this, would be to create a list and name each element of the list with the column names of your original df. i.e.
l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])
#so you can extract each one separately,
l1[['prop1']]
# A B
#1 0.0000000 1.0000000
#2 0.5000000 0.5000000
#3 0.6666667 0.3333333
Upvotes: 2
Reputation: 594
I think you want this:
library(reshape)
df[-1] <- lapply(df[-1],as.factor)
# second, rearrange vars in a named vector
df <- melt(df,id=c("id"),variable_name = "prop")
df$prop <- as.factor(df$prop)
#third, make the histograms with ggplot2
library(ggplot2)
h <- ggplot(df,aes(x=id))
h + geom_bar(stat="count", aes(fill=id)) + facet_grid(~ prop + value)
Upvotes: 1