Reputation: 5049
I have the histogram for a number of properties of different unique models of some 'thing'. When I do an experiment I find multiple of those unique models. I need to find the histogram of each property considering the entire sample set of the experiment.
Example:
With a data frame df
like below, with a bunch of id's
and for each id
there are a bunch of properties named prop1
, prop2
and so on.
set.seed(1)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
prop1 = rep(c("A", "B"), 3),
prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
prop3=sample(3:6, 6, replace = TRUE))
> df
id prop1 prop2 prop3
1 2 A FALSE 4
2 2 B TRUE 4
3 3 A FALSE 6
4 1 B TRUE 5
5 3 A FALSE 3
6 3 B FALSE 4
For eqch unique id
a histogram is computed for each property and the result is stored in a list l1
which holds the histogram for each property on a per id
basis.
# Create histogram for each property
df[-1] <- lapply(df[-1], as.factor)
fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i)
prop.table(table(i[,n])))))}
l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])
> l1
$prop1
A B
1 0.0000000 1.0000000
2 0.5000000 0.5000000
3 0.6666667 0.3333333
$prop2
FALSE TRUE
1 0.0 1.0
2 0.5 0.5
3 1.0 0.0
$prop3
3 4 5 6
1 0.0000000 0.0000000 1 0.0000000
2 0.0000000 1.0000000 0 0.0000000
3 0.3333333 0.3333333 0 0.3333333
Now below I have a new set of ids
from a new experiment, with repetitions. I need to compute the histogram for each property across the set of id's
using the reference data from l1
.
Some id's
may not be present; some id's
not present in the original df
and l1
may be present in ids
- example 4
in ids
is not present in l1
- however these can be excluded from the histogram computation , but captured as a dataframe with excluded id
and count for each id
excluded.
ids <- sample(1:4, 7, replace = TRUE)
> ids
[1] 2 3 1 3 3 2 4
Update: Expected output - I'm showing it as a list- any other data structure which is more appropriate could be used.
> l2
$prop1
A B
1 0.500 0.500
$prop2
FALSE TRUE
1 0.667 0.333
$prop3
3 4 5 6
1 0.167 0.500 0.167 0.167
base R solution preferred.
Update: Clarifying how the output is computed.
Counts in ids
- one 1, two 2, three 3 and one 4. Since we do not have any data for 4 the useful ids
are 1, 2 and 3 with total count of 6 ids
between them.
For prop1
, the aggregated histogram for ids
can be computed as follows
A = (1*0.0 + 2*0.5 + 3*0.6667)/6 = 0.5
B = (1*1.0 + 2*0.5 + 3*0.3333)/6 = 0.5
Upvotes: 1
Views: 488
Reputation: 6020
I have a solution for you, that does involve other packages (dplyr
and tidyr
). Since I am reshaping (melting) the data that you generated in a list. After that I spread the data into a nice data.frame. Offcourse you could also use the normalized version of the data. (df
) within function(x)
.
library(dplyr)
library(tidyr)
res <- do.call(rbind,
lapply(ids, function(id) do.call(cbind,
lapply(names(l1),function(x) {
df <- l1[[x]] %>% rownames_to_column("id")
df <- df[df$id == id,] %>% gather(key, value, -id)
if(nrow(df) > 0){
df[,'key'] <- paste0(x,'.',df[,'key'])
df <- df %>% spread(key,value)
}
df
}))
)
)
result:
> res
id prop1.A prop1.B id prop2.FALSE prop2.TRUE id prop3.4 prop3.5 prop3.6
1 2 0.6666667 0.3333333 2 0.6666667 0.3333333 2 0.3333333 0.6666667 0
2 3 1.0000000 0.0000000 3 1.0000000 0.0000000 3 0.0000000 0.0000000 1
3 2 0.6666667 0.3333333 2 0.6666667 0.3333333 2 0.3333333 0.6666667 0
4 2 0.6666667 0.3333333 2 0.6666667 0.3333333 2 0.3333333 0.6666667 0
5 2 0.6666667 0.3333333 2 0.6666667 0.3333333 2 0.3333333 0.6666667 0
You can also achieve this without your prop.table
function and only dplyr
which is a much neater solution.
propsum <- df %>% gather(key,value,-id) %>% mutate(n = 1) %>%
complete(nesting(key,value),id, fill=list(n = 0)) %>%
group_by(id, key, value) %>%
summarise(n = sum(n)) %>%
group_by(id, key) %>%
mutate(p = n/sum(n)
,col = paste0(key,'.',value)) %>%
ungroup() %>%
select(id, col, p) %>%
spread(col,p)
propsum[match(ids,propsum$id),]
result:
# A tibble: 10 × 8
id prop1.A prop1.B prop2.FALSE prop2.TRUE prop3.4 prop3.5 prop3.6
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 0.6666667 0.3333333 0.6666667 0.3333333 0.3333333 0.6666667 0
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 3 1.0000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 1
6 NA NA NA NA NA NA NA NA
7 2 0.6666667 0.3333333 0.6666667 0.3333333 0.3333333 0.6666667 0
8 2 0.6666667 0.3333333 0.6666667 0.3333333 0.3333333 0.6666667 0
9 NA NA NA NA NA NA NA NA
10 2 0.6666667 0.3333333 0.6666667 0.3333333 0.3333333 0.6666667 0
Since you added the expected result, I am not sure in what way this result was generated. I give you two options:
option 1: using the source data to multiply according to the given ids.
#option 1
data.frame(id = ids) %>% inner_join(df, by='id') %>%
gather(key, value, -id) %>%
group_by(key, value) %>%
mutate(n = 1) %>%
complete(nesting(key,value),id, fill=list(n = 0)) %>%
summarise(n = sum(n)) %>%
group_by(key) %>%
mutate(p = n/sum(n))
which results in:
key value n p
<chr> <chr> <dbl> <dbl>
1 prop1 A 9 0.69230769
2 prop1 B 4 0.30769231
3 prop2 FALSE 9 0.69230769
4 prop2 TRUE 4 0.30769231
5 prop3 4 4 0.30769231
6 prop3 5 8 0.61538462
7 prop3 6 1 0.07692308
or option 2: Using the aggregated data and calculate the mean proportion.
#option 2
df %>% gather(key,value,-id) %>% mutate(n = 1) %>%
complete(nesting(key,value),id, fill=list(n = 0)) %>%
group_by(id, key, value) %>%
summarise(n = sum(n)) %>%
group_by(id, key) %>%
mutate(p = n/sum(n)) %>%
inner_join(data.frame(id = ids), by='id') %>%
group_by(key, value) %>%
summarise(p = mean(p))
which results in:
Source: local data frame [7 x 3]
Groups: key [?]
key value p
<chr> <chr> <dbl>
1 prop1 A 0.7333333
2 prop1 B 0.2666667
3 prop2 FALSE 0.7333333
4 prop2 TRUE 0.2666667
5 prop3 4 0.2666667
6 prop3 5 0.5333333
7 prop3 6 0.2000000
Upvotes: 2