Reputation: 714
I have a dataset like so:
testdata <- read.table(header=T, text='
patids labels dbins vprobs Response
16186 SUP0 0.0 100 1
16186 SUP0 0.2 99 1
16186 SUP0 0.4 95 1
16186 SUP0 0.6 99 1
16186 SUP0 0.8 50 1
16186 SUP0 1.0 0 1
18185 SUP0 0.0 100 0
18185 SUP0 0.2 100 0
18185 SUP0 0.4 5 0
18185 SUP0 0.6 2 0
18185 SUP0 0.8 0 0
54234 INF0 0.0 100 1
54234 INF0 0.2 95 1
54234 INF0 0.4 90 1
54234 INF0 0.6 30 1
54234 INF0 0.8 0 1
18185 INF0 0.0 100 0
18185 INF0 0.2 20 0
18185 INF0 0.4 10 0
18185 INF0 0.6 5 0
18185 INF0 0.8 3 0
18185 INF0 1.0 0 0
16186 INF0 0.0 100 1
16186 INF0 0.2 100 1
16186 INF0 0.4 70 1
16186 INF0 0.6 60 1
16186 INF0 0.8 50 1
16186 INF0 1.0 0 1
54234 SUP1 0.0 100 1
54234 SUP1 0.2 95 1
54234 SUP1 0.4 90 1
54234 SUP1 0.6 30 1
54234 SUP1 0.8 0 1
18185 SUP1 0.0 100 0
18185 SUP1 0.2 50 0
18185 SUP1 0.4 0 0
16186 SUP1 0.0 100 1
16186 SUP1 0.2 100 1
16186 SUP1 0.4 40 1
16186 SUP1 0.6 10 1
16186 SUP1 0.8 22 1
16186 SUP1 1.0 0 1 ')
Now, for each "labels", i.e SUP0, SUP1 e.t.c, I want to obtain the mean of the variable dbins ( mean taken over all unique "patids" variables. The problem I am facing is that the "dbins" are not all of same length for each "patids". Is there some way to fill with NAs or 0's before taking this means ? My expected output has to be like this:
for SUP0
labels dbins dbins.16186 dbins.18185
SUP0 0.0 0.0
SUP0 0.2 0.2
SUP0 0.4 0.4
SUP0 0.6 0.6
SUP0 0.8 0.8
SUP0 1.0 NA
and for INF0
labels dbins.54234 dbins.18185 dbins.16186
INF0 0.0 0.0 0.0 0.0
INF0 0.2 0.2 0.0 0.2
INF0 0.4 0.4 0.0 0.4
INF0 0.6 0.6 0.0 0.6
INF0 0.8 0.8 0.8 0.8
INFO NA 1.0 1.0 1.0
...so that I can take mean over columns.
I have been trying with ddply and simillar functions but I can't get this particular output format. Can someone please help ?
Thanks in advance
Upvotes: 4
Views: 319
Reputation: 5898
for the means directly:
> require(data.table)
> testdata <- as.data.table(testdata)
> testdata[, mean(dbins), by=c("patids","labels")]
patids labels V1
1: 16186 SUP0 0.5
2: 18185 SUP0 0.4
3: 54234 INF0 0.4
4: 18185 INF0 0.5
5: 16186 INF0 0.5
6: 54234 SUP1 0.4
7: 18185 SUP1 0.2
8: 16186 SUP1 0.5
>
Upvotes: 1
Reputation: 14370
I want to obtain the mean of the variable dbins ( mean taken over all unique "patids" variables
Using data.table
R) library(data.table)
R) testdata=as.data.table(testdata)
R) testdata
patids labels dbins vprobs Response
1: 16186 SUP0 0.0 100 1
2: 16186 SUP0 0.2 99 1
3: 16186 SUP0 0.4 95 1
---
40: 16186 SUP1 0.6 10 1
41: 16186 SUP1 0.8 22 1
42: 16186 SUP1 1.0 0 1
patids labels dbins vprobs Response
R) testdata[,list(dbins=mean(dbins)),by="patids"]
patids dbins
1: 16186 0.5
2: 18185 0.4
3: 54234 0.4
Upvotes: 1
Reputation: 15405
The answer you want could be one of two things.
The exact output you've suggested.
The means of each of the categories (for which the output you've provided is just a method of getting there)
I'm going to use plyr and reshape2, but no doubt @mnel will be around soon to give a data.table solution.
The problem here is that you have several groups with multiple elements. So first, we need to group the elements (using @Maiasaura's solution here).
library(plyr)
testgroups <- ddply(testdata, .(labels, patids), function(x) { x$group <- 1:nrow(x); x })
Then we can reshape them properly:
library(reshape2)
testreshape <- dcast(testgroups[,c("labels", "patids", "dbins", "group")], labels+group~patids, value.var="dbins")
labels group 16186 18185 54234
1 INF0 1 0.0 0.0 0.0
2 INF0 2 0.2 0.2 0.2
3 INF0 3 0.4 0.4 0.4
4 INF0 4 0.6 0.6 0.6
5 INF0 5 0.8 0.8 0.8
6 INF0 6 1.0 1.0 NA
7 SUP0 1 0.0 0.0 NA
8 SUP0 2 0.2 0.2 NA
9 SUP0 3 0.4 0.4 NA
10 SUP0 4 0.6 0.6 NA
11 SUP0 5 0.8 0.8 NA
12 SUP0 6 1.0 NA NA
13 SUP1 1 0.0 0.0 0.0
14 SUP1 2 0.2 0.2 0.2
15 SUP1 3 0.4 0.4 0.4
16 SUP1 4 0.6 NA 0.6
17 SUP1 5 0.8 NA 0.8
18 SUP1 6 1.0 NA NA
From here you can use something like testreshape[testreshape$labels=="INF0",]
This is much simpler:
library(plyr)
testmeans <- ddply(testdata, .(labels, patids), summarise, mean=mean(dbins, na.rm=TRUE))
labels patids mean
1 INF0 16186 0.5
2 INF0 18185 0.5
3 INF0 54234 0.4
4 SUP0 16186 0.5
5 SUP0 18185 0.4
6 SUP1 16186 0.5
7 SUP1 18185 0.2
8 SUP1 54234 0.4
Upvotes: 3
Reputation: 42669
Simply take the mean
of each category directly with tapply
, which is made for ragged arrays:
tapply(testdata$dbins, interaction(testdata$labels, testdata$patid, drop=TRUE), FUN=mean)
## INF0.16186 SUP0.16186 SUP1.16186 INF0.18185 SUP0.18185 SUP1.18185 INF0.54234 SUP1.54234
## 0.5 0.5 0.5 0.5 0.4 0.2 0.4 0.4
Upvotes: 2