Reputation: 714

extracting complex subset from dataframe in R

I have a dataset like so:

testdata <- read.table(header=T, text='
patids labels dbins vprobs Response 
16186 SUP0         0.0         100        1 
16186 SUP0         0.2         99        1 
16186 SUP0         0.4         95        1 
16186 SUP0         0.6         99        1 
16186 SUP0         0.8         50        1 
16186 SUP0         1.0         0        1 
18185 SUP0         0.0         100        0 
18185 SUP0         0.2         100        0 
18185 SUP0         0.4         5        0 
18185 SUP0         0.6         2        0 
18185 SUP0         0.8         0        0 
54234 INF0         0.0         100        1 
54234 INF0         0.2         95        1 
54234 INF0         0.4         90        1 
54234 INF0         0.6         30        1 
54234 INF0         0.8         0        1 
18185 INF0         0.0         100        0 
18185 INF0         0.2         20        0 
18185 INF0         0.4         10        0 
18185 INF0         0.6         5        0 
18185 INF0         0.8         3        0 
18185 INF0         1.0         0        0 
16186 INF0         0.0         100        1 
16186 INF0         0.2         100        1 
16186 INF0         0.4         70        1 
16186 INF0         0.6         60        1 
16186 INF0         0.8         50        1 
16186 INF0         1.0         0        1 
54234 SUP1         0.0         100        1 
54234 SUP1         0.2         95        1 
54234 SUP1         0.4         90        1 
54234 SUP1         0.6         30        1 
54234 SUP1         0.8         0        1 
18185 SUP1         0.0         100        0 
18185 SUP1         0.2         50        0 
18185 SUP1         0.4         0        0
16186 SUP1         0.0         100        1 
16186 SUP1         0.2         100        1 
16186 SUP1         0.4         40        1 
16186 SUP1         0.6         10        1 
16186 SUP1         0.8         22        1 
16186 SUP1         1.0         0        1 ')

Now, for each "labels", i.e SUP0, SUP1 e.t.c, I want to obtain the mean of the variable dbins ( mean taken over all unique "patids" variables. The problem I am facing is that the "dbins" are not all of same length for each "patids". Is there some way to fill with NAs or 0's before taking this means ? My expected output has to be like this:

for SUP0

labels dbins dbins.16186 dbins.18185
SUP0         0.0         0.0 
SUP0         0.2         0.2          
SUP0         0.4         0.4          
SUP0         0.6         0.6          
SUP0         0.8         0.8          
SUP0         1.0         NA

and for INF0

labels      dbins.54234 dbins.18185 dbins.16186
INF0         0.0         0.0        0.0    0.0    
INF0         0.2         0.2        0.0    0.2   
INF0         0.4         0.4        0.0    0.4   
INF0         0.6         0.6        0.0    0.6   
INF0         0.8         0.8        0.8    0.8   
INFO         NA          1.0        1.0    1.0

...so that I can take mean over columns.

I have been trying with ddply and simillar functions but I can't get this particular output format. Can someone please help ?

Thanks in advance

Upvotes: 4

Answers (4)

Michael

Reputation: 5898

for the means directly:

> require(data.table)
> testdata <- as.data.table(testdata)
> testdata[, mean(dbins), by=c("patids","labels")]
   patids labels  V1
1:  16186   SUP0 0.5
2:  18185   SUP0 0.4
3:  54234   INF0 0.4
4:  18185   INF0 0.5
5:  16186   INF0 0.5
6:  54234   SUP1 0.4
7:  18185   SUP1 0.2
8:  16186   SUP1 0.5
>

Upvotes: 1

statquant

Reputation: 14370

I want to obtain the mean of the variable dbins ( mean taken over all unique "patids" variables

Using data.table

R) library(data.table)
R) testdata=as.data.table(testdata)
R) testdata
    patids labels dbins vprobs Response
 1:  16186   SUP0   0.0    100        1
 2:  16186   SUP0   0.2     99        1
 3:  16186   SUP0   0.4     95        1
---
40:  16186   SUP1   0.6     10        1
41:  16186   SUP1   0.8     22        1
42:  16186   SUP1   1.0      0        1
    patids labels dbins vprobs Response

R) testdata[,list(dbins=mean(dbins)),by="patids"]
   patids dbins
1:  16186   0.5
2:  18185   0.4
3:  54234   0.4

Upvotes: 1

sebastian-c

Reputation: 15405

The answer you want could be one of two things.

The exact output you've suggested.
The means of each of the categories (for which the output you've provided is just a method of getting there)

I'm going to use plyr and reshape2, but no doubt @mnel will be around soon to give a data.table solution.

1. The output you've suggested

The problem here is that you have several groups with multiple elements. So first, we need to group the elements (using @Maiasaura's solution here).

library(plyr)
testgroups <- ddply(testdata, .(labels, patids), function(x) { x$group <- 1:nrow(x); x })

Then we can reshape them properly:

library(reshape2)
testreshape <- dcast(testgroups[,c("labels", "patids", "dbins", "group")], labels+group~patids, value.var="dbins")

   labels group 16186 18185 54234
1    INF0     1   0.0   0.0   0.0
2    INF0     2   0.2   0.2   0.2
3    INF0     3   0.4   0.4   0.4
4    INF0     4   0.6   0.6   0.6
5    INF0     5   0.8   0.8   0.8
6    INF0     6   1.0   1.0    NA
7    SUP0     1   0.0   0.0    NA
8    SUP0     2   0.2   0.2    NA
9    SUP0     3   0.4   0.4    NA
10   SUP0     4   0.6   0.6    NA
11   SUP0     5   0.8   0.8    NA
12   SUP0     6   1.0    NA    NA
13   SUP1     1   0.0   0.0   0.0
14   SUP1     2   0.2   0.2   0.2
15   SUP1     3   0.4   0.4   0.4
16   SUP1     4   0.6    NA   0.6
17   SUP1     5   0.8    NA   0.8
18   SUP1     6   1.0    NA    NA

From here you can use something like testreshape[testreshape$labels=="INF0",]

2. The means of each of the categories

This is much simpler:

library(plyr)
testmeans <- ddply(testdata, .(labels, patids), summarise, mean=mean(dbins, na.rm=TRUE))

  labels patids mean
1   INF0  16186  0.5
2   INF0  18185  0.5
3   INF0  54234  0.4
4   SUP0  16186  0.5
5   SUP0  18185  0.4
6   SUP1  16186  0.5
7   SUP1  18185  0.2
8   SUP1  54234  0.4

Upvotes: 3

Matthew Lundberg

Reputation: 42669

Simply take the mean of each category directly with tapply, which is made for ragged arrays:

tapply(testdata$dbins, interaction(testdata$labels, testdata$patid, drop=TRUE), FUN=mean)
## INF0.16186 SUP0.16186 SUP1.16186 INF0.18185 SUP0.18185 SUP1.18185 INF0.54234 SUP1.54234 
##        0.5        0.5        0.5        0.5        0.4        0.2        0.4        0.4

Upvotes: 2

extracting complex subset from dataframe in R

Answers (4)

1. The output you've suggested

2. The means of each of the categories

Related Questions