user11406557
user11406557

Reputation: 1

I am trying to identify patterns of missing values in rows of a dataset

I am trying to find patterns in missing values in rows.

For example if I have this data set:

        a   b    c    d
        1   0.1  NA   NA
        2   NA   3    4
        5   NA   6    NA

I expect the output to be:

      n  a   b  c   d  m
      1  0   0  1   1  2
      1  0   1  0   0  1
      1  0   1  0   1  2

where column n shows the number of rows missing values in column m and 1's indicate missing values (except for columns n and m) .That is, the interpretation of the first row of the output is as follows: 1 row is missing 2 values which are for variables c and d; second row: 1 row is missing 1 value in variable b and so on.

I have tried using the subtable() function in extracat package(archived version) but I cant find the locations of missing values in each variables. I can only find frequencies.

rowmiss<-rowSums(is.na(dat1[1:ncol(dat1)]))
r1<-matrix(rowmiss, nrow=nrow(dat1))
subtable(rowmiss,1)

I expect the output to be as shown above. What I am finding so far is the frequency of missing values in rows but I expect patterns and positions of missing values.

Upvotes: 0

Views: 470

Answers (2)

Marius
Marius

Reputation: 60180

An alternative way of doing this with tidyverse:

library(tidyverse)

df %>%
    mutate_all(~ is.na(.) %>% as.numeric()) %>%
    mutate(m = rowSums(.)) %>%
    group_by_all() %>%
    count() 

Output (you may also want to ungroup() if doing anything further with the df):

# A tibble: 3 x 6
# Groups:   a, b, c, d, m [3]
      a     b     c     d     m     n
  <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1     0     0     1     1     2     1
2     0     1     0     0     1     1
3     0     1     0     1     2     1

mice::md.pattern() also does basically what you want, but returns a matrix with some of the useful info in the rownames, so would require a bit of processing to trun into a dataframe.

Upvotes: 1

Jon Spring
Jon Spring

Reputation: 66915

Here's a tidyverse approach. The n column seems redundant, should it be doing something else?

library(tidyverse)
df %>%
  rowid_to_column() %>%
  gather(col, val, -rowid) %>%
  mutate(val = is.na(val) * 1) %>%
  group_by(rowid) %>% mutate(m = sum(val)) %>% ungroup() %>%
  spread(col, val) %>%
  mutate(n = 1) %>%
  select(n, a:d, m)

# A tibble: 3 x 6
      n     a     b     c     d     m
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     0     0     1     1     2
2     1     0     1     0     0     1
3     1     0     1     0     1     2

Upvotes: 1

Related Questions