How can I identify whether missing values across variables come from the same ID variable?

Question

I have a huge dataset of questionnaire data. Looking at a subset of items I can see that for each of the items (let's say var1:var50) there are 25 NAs. Whilst it is likely that these 25 NAs are each coming from the same participants across items, I need to actually verify that this is true.

I managed to do this in quite a tedious way and I am looking for a more elegant solution to the problem.

Here a working example of my solution in R:

ID <- 1:10
var1 <- c(1,2,3,2,1,NA,1,3,2,NA)
var2 <- c(2,1,3,1,2,NA,3,2,1,NA)

df <- data.frame(ID,var1,var2)
df[which(is.na(df$var1 & df$var2)),]$ID

As you can see I would need to write down all individual variable names which can be very tedious when it comes to 50 or more questionnaire items.

Edo · Accepted Answer

You can try this way.

You can calculate how many NA each row has in this way:

n_na <- rowSums(is.na(df[,-1]))

Then you can see which ID has all NAs and which has just some.

# all NAs
df[n_na == (ncol(df)-1), "ID"]
#>  6 10

# some NAs
df[n_na > 0, "ID"]
#>  6 10

# some but not all
df[n_na > 0 & n_na < (ncol(df)-1), "ID"]
#> integer(0)

It's pretty scalable if you have many variables to handle.

Where df:

ID <- 1:10
var1 <- c(1,2,3,2,1,NA,1,3,2,NA)
var2 <- c(2,1,3,1,2,NA,3,2,1,NA)

df <- data.frame(ID,var1,var2)

How can I identify whether missing values across variables come from the same ID variable?

Answers (2)

Related Questions