Aggregate histogram data

Question

I have the histogram for a number of properties of different unique models of some 'thing'. When I do an experiment I find multiple of those unique models. I need to find the histogram of each property considering the entire sample set of the experiment.

Example:

With a data frame df like below, with a bunch of id's and for each id there are a bunch of properties named prop1, prop2 and so on.

set.seed(1)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
                     prop1 = rep(c("A", "B"), 3),
                     prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
                     prop3=sample(3:6, 6, replace = TRUE))

> df
  id prop1 prop2 prop3
1  2     A FALSE     4
2  2     B  TRUE     4
3  3     A FALSE     6
4  1     B  TRUE     5
5  3     A FALSE     3
6  3     B FALSE     4

For eqch unique id a histogram is computed for each property and the result is stored in a list l1 which holds the histogram for each property on a per id basis.

# Create histogram for each property
df[-1] <- lapply(df[-1], as.factor)
fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i) 
                                                         prop.table(table(i[,n])))))}
l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])

> l1
$prop1
          A         B
1 0.0000000 1.0000000
2 0.5000000 0.5000000
3 0.6666667 0.3333333

$prop2
  FALSE TRUE
1   0.0  1.0
2   0.5  0.5
3   1.0  0.0

$prop3
          3         4 5         6
1 0.0000000 0.0000000 1 0.0000000
2 0.0000000 1.0000000 0 0.0000000
3 0.3333333 0.3333333 0 0.3333333

Now below I have a new set of ids from a new experiment, with repetitions. I need to compute the histogram for each property across the set of id's using the reference data from l1.

Some id's may not be present; some id's not present in the original df and l1 may be present in ids- example 4 in ids is not present in l1 - however these can be excluded from the histogram computation , but captured as a dataframe with excluded id and count for each id excluded.

ids <- sample(1:4, 7, replace = TRUE)
> ids
 [1] 2 3 1 3 3 2 4

Update: Expected output - I'm showing it as a list- any other data structure which is more appropriate could be used.

> l2
$prop1
      A     B
1 0.500 0.500

$prop2
    FALSE    TRUE
1   0.667  0.333

$prop3
      3     4     5     6
1 0.167 0.500 0.167 0.167

base R solution preferred.

Update: Clarifying how the output is computed.

Counts in ids - one 1, two 2, three 3 and one 4. Since we do not have any data for 4 the useful ids are 1, 2 and 3 with total count of 6 ids between them.

For prop1, the aggregated histogram for ids can be computed as follows

A = (1*0.0 + 2*0.5  + 3*0.6667)/6 = 0.5
B = (1*1.0 + 2*0.5  + 3*0.3333)/6 = 0.5

Aggregate histogram data

Answers (1)

Related Questions