Reputation: 965
I'm attempting to remove a subject from the data set and then subsequently merge them in with the others so that their values at each time point can be compared to everyone else.
This is what the data looks like:
subject <- rep(1:5, each = 20)
seconds <- rep(1:20, times = 20)
variable <- rnorm(n = subject, mean = 20, sd = 10)
d <- data.frame(subject, seconds, variable)
Then, I am removing subject four from the data and trying to merge them back to compare them to each of the other subjects:
four <- subset(d, subject == 4)
d2 <- subset(d, subject != 4)
I've tried this but the problem is that it repeats each of the seconds 4 times for each merge:
merge(d2, four, by = "seconds")
Is there a way to get an exact merge of each individual relative to subject 4?
Upvotes: 2
Views: 42
Reputation: 11514
The problem in your code comes from the fact that only subjects 4
have values that satisfy seconds == 4
. See:
subject <- rep(1:5, each = 20)
seconds <- rep(1:20, each = 20)
d <- data.frame(subject, seconds)
with(d, table(subject, seconds))
seconds
subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 20 0 0 0 0 20 0 0 0 0 20 0 0 0 0 20 0 0 0 0
2 0 20 0 0 0 0 20 0 0 0 0 20 0 0 0 0 20 0 0 0
3 0 0 20 0 0 0 0 20 0 0 0 0 20 0 0 0 0 20 0 0
4 0 0 0 20 0 0 0 0 20 0 0 0 0 20 0 0 0 0 20 0
5 0 0 0 0 20 0 0 0 0 20 0 0 0 0 20 0 0 0 0 20
Since you are merging on seconds, and for all entries in four
, seconds equals four, the output of merge is correct, i.e. you would expect an empty table.
If you change the ordering, the problem will not occur.
subject <- rep(1:20, each = 5)
seconds <- rep(1:20, each = 20)
d <- data.frame(subject, seconds)
four <- subset(d, subject == 4)
d2 <- subset(d, subject != 4)
newdf <- merge(d2, four, by = "seconds")
head(newdf)
seconds subject.x subject.y
1 1 1 4
2 1 1 4
3 1 1 4
4 1 1 4
5 1 1 4
6 1 1 4
Where you see that now we can find subjects in x and in y, i.e. the left and right dataframe handed over to merge
.
A comment: what you are after sounds more like reshaping your data, but you still need to figure out what to do with your duplicates. To give you an idea:
library(reshape2)
d$ind <- factor(d$subject==4, labels = c("four", "not four"))
out <- dcast(d, seconds ~ ind, fun.aggregate = function(x) x[1], value.var = "variable")
head(out)
seconds four not four
1 1 20.836195 16.539739
2 2 15.923540 11.534704
3 3 1.250495 12.992153
4 4 25.127817 31.510210
5 5 8.990819 8.030607
6 6 21.783900 38.300430
This will take the first value whenever there is a duplicate.
Upvotes: 2