Reputation: 505
I created a dataset by joining multiple tables to a primary table containing department_id and year. The final data frame had a lot of missing values which I then imputed with 'MISSING' for categorical and with '0'(zero) for continuous variables.
I now want to remove the subset of rows that are populated with either 'MISSING' or '0' (i.e. have no other values), how can I do this in R?
Thanks
Upvotes: 0
Views: 578
Reputation:
I would strongly suggest leaving your NA
s the way they are if you can. R has built-in generic functions for dealing with NA
s across classes that can make your life much easier. If your missings are indicated by different values for each data type then you'll need to add a comparison for each type of missing, which isn't very efficient.
It's also worth mentioning that the options below are generalizable, i.e. they will work on data frames with any number of columns, so you don't need to add a comparison for each new column.
First, generate some data to test with:
df <- data.frame(num = c(1, 0, 3, 4, 0, 5),
cat = c("a", "b", "c", "d", "MISSING", "MISSING")
)
#### OUTPUT ####
num cat
1 1 a
2 0 b # <- keep
3 3 c
4 4 d
5 0 MISSING # <- drop
6 5 MISSING # <- keep
You can filter using base R or dplyr
(among other options):
# Base R option
df[rowSums(df == "MISSING" | df == 0) < ncol(df),]
# Tidyverse option using dplyr
library(dplyr)
filter_all(df, any_vars(!(. == "MISSING" | . == 0)))
The output for both options will look like this:
num cat
1 1 a
2 0 b # <- kept
3 3 c
4 4 d
5 5 MISSING # <- kept
Just for the sake of argument, here's how you can simplify things by leaving NA
s as they are. First some new data:
df_na <- data.frame(num = c(1, NA, 3, 4, NA, 5),
cat = c("a", "b", "c", "d", NA, NA)
)
#### OUTPUT ####
num cat
1 1 a
2 NA b # <- keep
3 3 c
4 4 d
5 NA <NA> # <- drop
6 5 <NA> # <- keep
Now we can use the same strategies as above, but we only need to use is.na()
rather than adding a comparison for each type of missing value:
# Using base R
df_na[rowSums(is.na(df_na)) < ncol(df_na),]
# Using dplyr
library(dplyr)
filter_all(df_na, any_vars(!is.na(.)))
#### OUTPUT ####
num cat
1 1 a
2 NA b # <- kept
3 3 c
4 4 d
6 5 <NA> # <- kept
Upvotes: 1
Reputation: 319
You are right that Ott's solution doesn't do what they say it does. Here's his solution implemented correctly, in base R and in dplyr
. Note that you will have to duplicate each != 0
clause for each of your columns.
# create some dummy data
data <- data.frame(
numeric = c(1, 2, 3, 0, 0, 0, 4, 5, 6),
categorical = c("MISSING", "A", "B", "MISSING", "C", "MISSING", "D", "MISSING", "E")
)
# base R solution
data[data$numeric != 0 | data$categorical != "MISSING", ]
# dplyr solution
filter(data, numeric != 0 | categorical != "MISSING")
Upvotes: 0