Reputation: 474
This is a problem I've come across with some surveyed follow-up data and have painfully been doing this manually. There must be an elegant way to do this in R, but I just haven't found a solution anywhere to this problem.
Specifically I have a data.frame
where the same individuals have been asked to provide a baseline value and then follow-up value of a variable of interest. If I want to perform a paired analysis downstream, I can only use those individuals for whom I have complete data on both
a) baseline
b) follow-up
Here's some sample data
a <- rep(c("Peter", "Pan", "Mickey", "Mouse"), times = 1, each = 2)
b <- rep(c("Baseline", "Follow-up"), times = 4)
c <- c(6, 1, 6, 3, 7, 3, NA, 1)
df <- data.frame(a, b, c)
df
a b c
1 Peter Baseline 6
2 Peter Follow-up 1
3 Pan Baseline 6
4 Pan Follow-up 3
5 Mickey Baseline 7
6 Mickey Follow-up 3
7 Mouse Baseline NA
8 Mouse Follow-up 1
As we can see Mouse has missing baseline information and as such cannot be included in the paired analysis.
df2 <- df[complete.cases(df),]
# OR
df2 <- na.omit(df) # both produce the same result
df2
a b c
1 Peter Baseline 6
2 Peter Follow-up 1
3 Pan Baseline 6
4 Pan Follow-up 3
5 Mickey Baseline 7
6 Mickey Follow-up 3
8 Mouse Follow-up 1
are not helpful in this situation, because they retain the follow-up value of Mouse, whereas I only want to perform further analysis on individuals of whom I have complete data, as otherwise I may be including incomplete data thus falsifying my downstream analysis.
If anyone would have a great idea on how to a) identify the individuals with missing data of the variable of interest and b) how to subsequently remove ALL data of that individual, that would be awesome.
Thanks.
UPDATE FOLLOWING THE EXCELLENT INITIAL ANSWERS.
The real data I'm dealing with is not as "symmetrical" as the example data and I have the following issue.
a <- rep(c("Peter", "Pan", "Mickey", "Mouse"), times = 1, each = 4)
b <- rep(c("Baseline", "Follow-up"), times = 4, each = 2)
c <- rep(c(6, 6, 1, 1, 7, 7, 3, 2, NA, 9, 1, 1, 7, 7, NA, 2))
d <- rep(c("Arm", "Leg"), times = 8, each = 1)
df <- data.frame(a, b, c, d)
so as you can see the individuals are surveyed for different locations (here arbitrarily arms and legs) for the same variable c.
if I now apply the solutions provided:
df %>% group_by(a) %>% filter(all(!is.na(c)))
Source: local data frame [8 x 4]
Groups: a
a b c d
1 Peter Baseline 6 Arm
2 Peter Baseline 6 Leg
3 Peter Follow-up 1 Arm
4 Peter Follow-up 1 Leg
5 Pan Baseline 7 Arm
6 Pan Baseline 7 Leg
7 Pan Follow-up 3 Arm
8 Pan Follow-up 2 Leg
I lose all the individuals for whom I may have had incomplete information with regards to one location, e.g. Mouse has complete data for df$d == "Leg"
and thus I could use this downstream.
If I apply Paulo's solution - I get an incomplete dataframe, because both measurements at baseline for Mouse for example were retained.
df %>%
filter(complete.cases(.))%>%
group_by(a)%>%
mutate(n=length(c))%>%
filter(n>1)
Source: local data frame [14 x 5]
Groups: a
a b c d n
1 Peter Baseline 6 Arm 4
2 Peter Baseline 6 Leg 4
3 Peter Follow-up 1 Arm 4
4 Peter Follow-up 1 Leg 4
5 Pan Baseline 7 Arm 4
6 Pan Baseline 7 Leg 4
7 Pan Follow-up 3 Arm 4
8 Pan Follow-up 2 Leg 4
9 Mickey Baseline 9 Leg 3
10 Mickey Follow-up 1 Arm 3
11 Mickey Follow-up 1 Leg 3
12 Mouse Baseline 7 Arm 3
13 Mouse Baseline 7 Leg 3
14 Mouse Follow-up 2 Leg 3
Note, my real data has a situation where "Peter" has 4 baseline and follow-up measurements, "Pan" has 6 baseline and follow-up measurements etc.
Sorry to be a hassle and as written in the comments, replies have been accepted of course. But if you have an idea how to solve this issue, that would be awesome! Thanks.
Upvotes: 1
Views: 269
Reputation: 5856
EDITED
For your new problem, something like this
library(dplyr)
df%>%
filter(complete.cases(.))%>%
group_by(a, d)%>%
mutate(n=length(c))%>%
filter(n>1)%>%select(-n)
will give you this
Source: local data frame [12 x 4]
Groups: a, d
a b c d
1 Peter Baseline 6 Arm
2 Peter Baseline 6 Leg
3 Peter Follow-up 1 Arm
4 Peter Follow-up 1 Leg
5 Pan Baseline 7 Arm
6 Pan Baseline 7 Leg
7 Pan Follow-up 3 Arm
8 Pan Follow-up 2 Leg
9 Mickey Baseline 9 Leg
10 Mickey Follow-up 1 Leg
11 Mouse Baseline 7 Leg
12 Mouse Follow-up 2 Leg
Upvotes: 3
Reputation: 24945
You could try a dplyr
approach:
library(dplyr)
df %>% group_by(a) %>%
filter(!any(is.na(c)))
This will kick out any groups (of your a) that have any NAs in the c column.
For the new data, unless I've read it wrong you just want to group by (a, d)
now:
df %>% group_by(a, d) %>%
filter(!any(is.na(c)))
Upvotes: 3