Reputation: 5049

apply distribution to new sample set

I have a data-frame dfu that holds for each id (id belongs to one team, team has many ids) the percentage samples where a bunch of properties prop1, prop2 and so on are observed based on some past studies - this is used as sort of reference table for future studies. Now there is data from new experiment which gives a new set of ids. I need to find the percentage samples where prop1, prop2 and so on are observed on per team basis by using the reference data in dfu. This could be done by counting the number of occurrences per id in dfi and then take a weighted average grouped by team.- not all ids in dfu may be present and one or more ids not present in dfu may be present in dfi. The ids not present in dfu may be excluded from the weighted average as no presence per property values are available for them.

dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
> dfu
  id team prop1 prop2
   1    A   0.8   0.2
   2    B   0.9   0.3
   3    C   0.6   0.3
   4    A   0.5   0.2
   5    A   0.8   0.2
   6    C   0.9   0.3
> 
> dfi <- data.frame(id=c(2  , 3  , 2  , 1  , 4  , 3  , 7))
> dfi
  id
   2
   3
   2
   1
   4
   3
   7

The output format would be like below. For example the value for prop1 for group A would be (0.8*1 + 0.5*1)/2 = 0.65.

team  prop1   prop2
  A            
  B    
  C

prefer base R approach, other approaches welcome. The number of columns could be many.

Upvotes: 0

Answers (2)

user3206440

Reputation: 5049

taking cue from @DJJ's answer.

dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"),
          prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), 
          prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2  , 3  , 2  , 1  , 4  , 3  , 7))

Merge by id

> dfx <- merge(dfi, dfu, by="id")
> dfx
  id team prop1 prop2
1  1    A   0.8   0.2
2  2    B   0.9   0.3
3  2    B   0.9   0.3
4  3    C   0.6   0.3
5  3    C   0.6   0.3
6  4    A   0.5   0.2

Aggregate prop1 and prop2 by team with mean

> aggregate(cbind(prop1, prop2) ~ team, dfx, mean)
  team prop1 prop2
1    A  0.65   0.2
2    B  0.90   0.3
3    C  0.60   0.3

Upvotes: 1

DJJ

Reputation: 2539

I don't know exactly how to do it with base R.

With data.table it's should be pretty easy. Let convert your data.frames into data.table.

library(data.table)

dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))

dfi <- data.frame(id=c(2  , 3  , 2  , 1  , 4  , 3  , 7))
dfi <- data.table(dfi)
dfu <- data.table(dfu)

Then merge them like

dfu[dfi,on="id"]

## > dfu[dfi,on="id"]                  
##    id team prop1 prop2
## 1:  2    B   0.9   0.3
## 2:  3    C   0.6   0.3
## 3:  2    B   0.9   0.3
## 4:  1    A   0.8   0.2
## 5:  4    A   0.5   0.2
## 6:  3    C   0.6   0.3
## 7:  7   NA    NA    NA

Then we just have to perform the mean by group. In fact we can to it one liner like

    dfu[dfi,on="id"][,mean(prop1),team]

## > dfu[dfi,on="id"][,mean(prop1),team] 
##    team   V1
## 1:    B 0.90
## 2:    C 0.60
## 3:    A 0.65
## 4:   NA   NA

You can achieve the same thing in base R by merging the data.frame and using the function aggregate I guess.

Upvotes: 1

apply distribution to new sample set

Answers (2)

Related Questions