OFish
OFish

Reputation: 474

How do you identify missing pairs within one data.frame in R?

This is a problem I've come across with some surveyed follow-up data and have painfully been doing this manually. There must be an elegant way to do this in R, but I just haven't found a solution anywhere to this problem. Specifically I have a data.frame where the same individuals have been asked to provide a baseline value and then follow-up value of a variable of interest. If I want to perform a paired analysis downstream, I can only use those individuals for whom I have complete data on both a) baseline b) follow-up

Here's some sample data

a <- rep(c("Peter", "Pan", "Mickey", "Mouse"), times = 1, each = 2)
b <- rep(c("Baseline", "Follow-up"), times = 4)
c <- c(6, 1, 6, 3, 7, 3, NA, 1)
df <- data.frame(a, b, c)
df
       a         b  c
1  Peter  Baseline  6
2  Peter Follow-up  1
3    Pan  Baseline  6
4    Pan Follow-up  3
5 Mickey  Baseline  7
6 Mickey Follow-up  3
7  Mouse  Baseline NA
8  Mouse Follow-up  1

As we can see Mouse has missing baseline information and as such cannot be included in the paired analysis.

df2 <- df[complete.cases(df),]
# OR 
df2 <- na.omit(df) # both produce the same result
df2
       a         b c
1  Peter  Baseline 6
2  Peter Follow-up 1
3    Pan  Baseline 6
4    Pan Follow-up 3
5 Mickey  Baseline 7
6 Mickey Follow-up 3
8  Mouse Follow-up 1

are not helpful in this situation, because they retain the follow-up value of Mouse, whereas I only want to perform further analysis on individuals of whom I have complete data, as otherwise I may be including incomplete data thus falsifying my downstream analysis.

If anyone would have a great idea on how to a) identify the individuals with missing data of the variable of interest and b) how to subsequently remove ALL data of that individual, that would be awesome.

Thanks.

UPDATE FOLLOWING THE EXCELLENT INITIAL ANSWERS.

The real data I'm dealing with is not as "symmetrical" as the example data and I have the following issue.

a <- rep(c("Peter", "Pan", "Mickey", "Mouse"), times = 1, each = 4)
b <- rep(c("Baseline", "Follow-up"), times = 4, each = 2)
c <- rep(c(6, 6, 1, 1, 7, 7, 3, 2, NA, 9, 1, 1, 7, 7, NA, 2))
d <- rep(c("Arm", "Leg"), times = 8, each = 1)
df <- data.frame(a, b, c, d)

so as you can see the individuals are surveyed for different locations (here arbitrarily arms and legs) for the same variable c.

if I now apply the solutions provided:

df %>% group_by(a) %>% filter(all(!is.na(c)))
    Source: local data frame [8 x 4]
Groups: a

      a         b c   d
1 Peter  Baseline 6 Arm
2 Peter  Baseline 6 Leg
3 Peter Follow-up 1 Arm
4 Peter Follow-up 1 Leg
5   Pan  Baseline 7 Arm
6   Pan  Baseline 7 Leg
7   Pan Follow-up 3 Arm
8   Pan Follow-up 2 Leg

I lose all the individuals for whom I may have had incomplete information with regards to one location, e.g. Mouse has complete data for df$d == "Leg" and thus I could use this downstream.

If I apply Paulo's solution - I get an incomplete dataframe, because both measurements at baseline for Mouse for example were retained.

df %>% 
filter(complete.cases(.))%>%
group_by(a)%>%
mutate(n=length(c))%>% 
filter(n>1)
Source: local data frame [14 x 5]
Groups: a

        a         b c   d n
1   Peter  Baseline 6 Arm 4
2   Peter  Baseline 6 Leg 4
3   Peter Follow-up 1 Arm 4
4   Peter Follow-up 1 Leg 4
5     Pan  Baseline 7 Arm 4
6     Pan  Baseline 7 Leg 4
7     Pan Follow-up 3 Arm 4
8     Pan Follow-up 2 Leg 4
9  Mickey  Baseline 9 Leg 3
10 Mickey Follow-up 1 Arm 3
11 Mickey Follow-up 1 Leg 3
12  Mouse  Baseline 7 Arm 3
13  Mouse  Baseline 7 Leg 3
14  Mouse Follow-up 2 Leg 3

Note, my real data has a situation where "Peter" has 4 baseline and follow-up measurements, "Pan" has 6 baseline and follow-up measurements etc.

Sorry to be a hassle and as written in the comments, replies have been accepted of course. But if you have an idea how to solve this issue, that would be awesome! Thanks.

Upvotes: 1

Views: 269

Answers (2)

Paulo E. Cardoso
Paulo E. Cardoso

Reputation: 5856

EDITED

For your new problem, something like this

library(dplyr)
df%>%
  filter(complete.cases(.))%>%
  group_by(a, d)%>%
  mutate(n=length(c))%>%
  filter(n>1)%>%select(-n)

will give you this

Source: local data frame [12 x 4]
Groups: a, d

        a         b c   d
1   Peter  Baseline 6 Arm
2   Peter  Baseline 6 Leg
3   Peter Follow-up 1 Arm
4   Peter Follow-up 1 Leg
5     Pan  Baseline 7 Arm
6     Pan  Baseline 7 Leg
7     Pan Follow-up 3 Arm
8     Pan Follow-up 2 Leg
9  Mickey  Baseline 9 Leg
10 Mickey Follow-up 1 Leg
11  Mouse  Baseline 7 Leg
12  Mouse Follow-up 2 Leg

Upvotes: 3

jeremycg
jeremycg

Reputation: 24945

You could try a dplyr approach:

library(dplyr)
df %>% group_by(a) %>%
       filter(!any(is.na(c)))

This will kick out any groups (of your a) that have any NAs in the c column.

For the new data, unless I've read it wrong you just want to group by (a, d) now:

df %>% group_by(a, d) %>%
       filter(!any(is.na(c)))

Upvotes: 3

Related Questions