landrower
landrower

Reputation: 505

How to identify subset of rows containing only 'MISSING' and zeroes ('0') in R

I created a dataset by joining multiple tables to a primary table containing department_id and year. The final data frame had a lot of missing values which I then imputed with 'MISSING' for categorical and with '0'(zero) for continuous variables.

I now want to remove the subset of rows that are populated with either 'MISSING' or '0' (i.e. have no other values), how can I do this in R?

Thanks

Upvotes: 0

Views: 578

Answers (2)

user10191355
user10191355

Reputation:

I would strongly suggest leaving your NAs the way they are if you can. R has built-in generic functions for dealing with NAs across classes that can make your life much easier. If your missings are indicated by different values for each data type then you'll need to add a comparison for each type of missing, which isn't very efficient.

It's also worth mentioning that the options below are generalizable, i.e. they will work on data frames with any number of columns, so you don't need to add a comparison for each new column.

First, generate some data to test with:

df <- data.frame(num = c(1, 0, 3, 4, 0, 5),
                 cat = c("a", "b", "c", "d", "MISSING", "MISSING")
                 )

#### OUTPUT ####
  num     cat
1   1       a
2   0       b # <- keep
3   3       c
4   4       d
5   0 MISSING # <- drop
6   5 MISSING # <- keep

You can filter using base R or dplyr (among other options):

# Base R option
df[rowSums(df == "MISSING" | df == 0) < ncol(df),]

# Tidyverse option using dplyr
library(dplyr)
filter_all(df, any_vars(!(. == "MISSING" | . == 0)))

The output for both options will look like this:

  num     cat
1   1       a
2   0       b # <- kept
3   3       c
4   4       d
5   5 MISSING # <- kept

Just for the sake of argument, here's how you can simplify things by leaving NAs as they are. First some new data:

df_na <- data.frame(num = c(1, NA, 3, 4, NA, 5),
                    cat = c("a", "b", "c", "d", NA, NA)
                    )

#### OUTPUT ####
  num  cat
1   1    a
2  NA    b # <- keep
3   3    c
4   4    d
5  NA <NA> # <- drop
6   5 <NA> # <- keep

Now we can use the same strategies as above, but we only need to use is.na() rather than adding a comparison for each type of missing value:

# Using base R
df_na[rowSums(is.na(df_na)) < ncol(df_na),]

# Using dplyr
library(dplyr)
filter_all(df_na, any_vars(!is.na(.)))

#### OUTPUT ####
  num  cat
1   1    a
2  NA    b # <- kept
3   3    c
4   4    d
6   5 <NA> # <- kept

Upvotes: 1

jdonland
jdonland

Reputation: 319

You are right that Ott's solution doesn't do what they say it does. Here's his solution implemented correctly, in base R and in dplyr. Note that you will have to duplicate each != 0 clause for each of your columns.

# create some dummy data
data <- data.frame(
numeric = c(1, 2, 3, 0, 0, 0, 4, 5, 6),
categorical = c("MISSING", "A", "B", "MISSING", "C", "MISSING", "D", "MISSING", "E")
)

# base R solution
data[data$numeric != 0 | data$categorical != "MISSING", ]

# dplyr solution
filter(data, numeric != 0 | categorical != "MISSING")

Upvotes: 0

Related Questions