Reputation: 45

Selecting columns based on missing values in each row

I would like to know (for each row) which columns any NA in my data falls under. The goal is to create a new column/variable that lists the names of the columns the data shows NA for, for that particular row, preferably using dplyr.

Using this mock data,

data = tibble(var_1 = c(NA, 4, 5, 6, 7), var_2 = c(4, 5, 6, 7, 8), var_3 = c(NA, NA, NA, 3, 5))

I'd like to create the missing_col column:

  var_1 var_2 var_3       missing_col
1    NA     4    NA  "var_1", "var_3"             
2     4     5    NA           "var_3"
3     5     6    NA           "var_3"
4     6     7     3                NA
5     7     8     5                NA

My approach thus far has been to use the rowwise() function in conjunction with mutate and a nested select_if() and a function. However, none of the functions that I have tried so far have allowed me to only consider each row individually (as opposed to the entire column). Below I have included the general structure of my approach.

data %>% 
  rowwise() %>%
  mutate(missing_col = select_if(function(x) ... )) %>%
  names()

Any guidance toward the appropriate function would be appreciated.

Upvotes: 2

Answers (4)

Leo Brueggeman

Reputation: 2521

This isn't a tidy way, but an apply seems to be simple enough:

data = tibble(var_1 = c(NA, 4, 5, 6, 7), var_2 = c(4, 5, 6, 7, 8), var_3 = c(NA, NA, NA, 3, 5))

data$missing = apply(data,1,function(x) names(x)[is.na(x)])

which outputs:

data
# A tibble: 5 x 4
  var_1 var_2 var_3 missing  
  <dbl> <dbl> <dbl> <list>   
1 NA     4.00 NA    <chr [2]>
2  4.00  5.00 NA    <chr [1]>
3  5.00  6.00 NA    <chr [1]>
4  6.00  7.00  3.00 <chr [0]>
5  7.00  8.00  5.00 <chr [0]>

So each element of the missing column is a list containing the character vector

data$missing[[1]]
[1] "var_1" "var_3"
data$missing[[1]][1]
[1] "var_1"

Upvotes: 0

Tom

Reputation: 602

A possible solution is to use apply directly in mutate instead of rowwise. Perhaps an equivalent approach is possible with rowwise, but my experience with that function is limited. The second mutate is only necessary if you need NA for lines without NA.

data %>% 
  mutate(missing_col = apply(., 1, function(x) names(.)[is.na(x)] %>% paste(collapse = ", "))) %>% 
  mutate(missing_col = if_else(missing_col == "", NA_character_, missing_col))

Upvotes: 0

Jilber Urbina

Reputation: 61214

> data %>% 
+   mutate(missing_col = apply(., 1, function(x) which(is.na(x)))  %>% 
+            map_chr(., function(x) if_else(length(x)==0, 
+                                           "NA", 
+                                           paste(names(x), collapse=", "))))
# A tibble: 5 x 4
  var_1 var_2 var_3 missing_col 
  <dbl> <dbl> <dbl> <chr>       
1    NA     4    NA var_1, var_3
2     4     5    NA var_3       
3     5     6    NA var_3       
4     6     7     3 NA          
5     7     8     5 NA

Upvotes: 1

Cris

Reputation: 796

You can try this:

#Unlist the results from apply
missing_col=unlist(x=apply(X=data, MARGIN=1, FUN=function(x){

  #Get the names of the rows which have NA
  NamesNA=names(which(is.na(x)))

  #If there's no NA then replace the 'character(0)' for NA.
  if(length(NamesNA)!=0){

    #Concatenate names if there are more than one
    paste0(NamesNA, collapse=",")

  }else{

    #Replace 'character(0)'
    NA  

  }
})
)

#Add column with desired output
data$missing_col=missing_col

This give the desired output:

# A tibble: 5 x 4
  var_1 var_2 var_3 missing_col
  <dbl> <dbl> <dbl> <chr>      
1    NA     4    NA var_1,var_3
2     4     5    NA var_3      
3     5     6    NA var_3      
4     6     7     3 NA         
5     7     8     5 NA

Upvotes: 1

Selecting columns based on missing values in each row

Answers (4)

Related Questions