Reputation: 5
I have a huge dataset of questionnaire data. Looking at a subset of items I can see that for each of the items (let's say var1:var50) there are 25 NAs. Whilst it is likely that these 25 NAs are each coming from the same participants across items, I need to actually verify that this is true.
I managed to do this in quite a tedious way and I am looking for a more elegant solution to the problem.
Here a working example of my solution in R:
ID <- 1:10
var1 <- c(1,2,3,2,1,NA,1,3,2,NA)
var2 <- c(2,1,3,1,2,NA,3,2,1,NA)
df <- data.frame(ID,var1,var2)
df[which(is.na(df$var1 & df$var2)),]$ID
As you can see I would need to write down all individual variable names which can be very tedious when it comes to 50 or more questionnaire items.
Upvotes: 0
Views: 141
Reputation: 177
There is package called DataExplorer and it could come handy for large dataset. Once you install DataExplorer just load the library
library(DataExplorer)
Once you load it you can simply use plot_missing Below is the code
plot_missing(dataframe)
Once you run it,you can see the count as well as visualization
The best part is that you can use it on any dataframe If you need python code for this library i have that as well, i am working on creating package in python.
Also you can use this package for other EDA
Upvotes: 0
Reputation: 7858
You can try this way.
You can calculate how many NA each row has in this way:
n_na <- rowSums(is.na(df[,-1]))
Then you can see which ID has all NAs and which has just some.
# all NAs
df[n_na == (ncol(df)-1), "ID"]
#> 6 10
# some NAs
df[n_na > 0, "ID"]
#> 6 10
# some but not all
df[n_na > 0 & n_na < (ncol(df)-1), "ID"]
#> integer(0)
It's pretty scalable if you have many variables to handle.
Where df
:
ID <- 1:10
var1 <- c(1,2,3,2,1,NA,1,3,2,NA)
var2 <- c(2,1,3,1,2,NA,3,2,1,NA)
df <- data.frame(ID,var1,var2)
Upvotes: 1