Reputation: 23
I'm new and learning R. I'm trying to ask a question that I don't know the words for. Suppose I have a data frame such that:
df<-data.frame(ID=c("A","A","A","B","B","B","C","C","C"),
Week=c(1,2,3,1,2,3,1,2,3),
Variable=c(30,25,27,42,44,45,30,50,19))
ID Week Variable
1 A 1 30
2 A 2 25
3 A 3 27
4 B 1 42
5 B 2 44
6 B 3 45
7 C 1 30
8 C 2 50
9 C 3 19
How can I find what is the average Variable at Week 2 for all ID that had Variable = 30 at Week 1?
For example, I would like the output in this example to = 37.5
Upvotes: 0
Views: 65
Reputation: 11649
Step 1: Obtain the IDs which had Variable=30 in Week1
res<- subset( df,Variable==30 & Week==1, ID )
The output is:
> res
ID
1 A
7 C
Step 2: Get all their variables at week 2:
dt<-subset(df,ID %in% as.vector(unlist(res)) & Week==2 ,select=c(ID,Variable))
The output is:
ID Variable
2 A 25
8 C 50
Step 3: Get the mean:
mean(dt$Variable)
The final output is:
37.5
In step 2 we have ID %in% as.vector(unlist(res))
. So, what does it mean?
The %in% part simply is an operator which returns true if it finds an ID inside the right handside vector. For example, run the below sample:
a<- 1:10
b<-c(0,4,6,8,16)
b %in% a
and the result is:
FALSE TRUE TRUE TRUE FALSE
So, the %in% operator returns a Boolean value for each element of b
. The result will be True if, that element exist in a
, otherwise it returns False. As you see 0 and 16 have False.
But, the point is, a
should be vector, meanwhile res
is a data.frame
so, i need to first unlist
it, and then consider it as a vector
(as.vector
).
In conclusion, ID %in% as.vector(unlist(res))
checks if each ID exist in res or not.
Upvotes: 0
Reputation: 10671
This might be easier to read/see.
library(tidyverse)
df %>%
spread(Week, Variable) %>%
filter(`1` == 30) %>%
with(mean(`2`))
[1] 37.5
I think tidyverse
code is easier to understand because you can read it left to right like you would any non-code text. And the piping %>%
makes seeing the order of operations easier, no more parentheses to parse.
Upvotes: 1
Reputation: 3221
First we need ID
's which have entry for variable=30 AND week=1
and then from that ID
's extract ID
's with Week=2
and do avg(Variable)
Base R Solution:
mean(df[df$ID %in% (df[df$Week==1 & df$Variable==30,1]) & df$Week==2,3])
Output:
[1] 37.5
OR (another approach)
Using sqldf
:
library(sqldf)
sqldf("select avg(Variable) from df where ID IN (select ID from df where variable=30 AND week=1) AND Week=2")
Output:
avg(Variable)
1 37.5
Upvotes: 0