Reputation: 661
I have the following data frame:
a <- c(3, 2, 1)
a_1 <- c(NA, "", NA)
b <- c(3, 4, 1)
b_1 <- c(3, NA, 4)
c <- c("", "", "")
c_1 <- c(5, 8, 9)
d <- c(6, 9, 10)
d_1 <- c("", "", "")
e <- c(NA, NA, NA)
e_1 <- c(NA, NA, NA)
df <- data.frame(a, a_1, b, b_1, c, c_1,
d, d_1, e, e_1)
I want to remove columns that contain "_1" with only empty cells and NAs. However, I've only been able to find the code that removes ALL columns in the data frame that contain empty cells and NAs.
empty_columns <- colSums(is.na(df) |
df == "") == nrow(df)
df[, !empty_columns]
df <- df[, colSums(is.na(df)) < nrow(df)]
a b b_1 c_1 d
1 3 3 3 5 6
2 2 4 NA 8 9
3 1 1 4 9 10
But I want my resulting data frame to look like this:
df2 <- data.frame(a, b, b_1, c, c_1, d, e)
a b b_1 c c_1 d e
1 3 3 3 5 6 NA
2 2 4 NA 8 9 NA
3 1 1 4 9 10 NA
Upvotes: 4
Views: 159
Reputation: 16842
A couple more riffs on the empty_columns
vector if you're like me and prefer iterating over columns. The tricky thing in the OP is having access to both the names and values of each column; you can do this by working across names and retrieving their columns, or with purrr::imap_lgl
which is a shortcut for that.
empty1 <- sapply(names(df), function(n) all(is.na(df[n]) | df[n] == "") & grepl("_1", n))
empty2 <- purrr::imap_lgl(df, function(col, id) all(is.na(col) | col == "") & grepl("_1", id))
You could also use regex in place of combining is.na
and == ""
.
empty3 <- purrr::imap_lgl(df, function(col, id) !any(grepl(".", col)) & grepl("_1", id))
All 3 get the same set of columns to drop.
df[!empty1]
#> a b b_1 c c_1 d e
#> 1 3 3 3 5 6 NA
#> 2 2 4 NA 8 9 NA
#> 3 1 1 4 9 10 NA
Upvotes: 0
Reputation: 887118
Here is one option with tidyverse
, select
the columns that ends_with
("_1") and (&
) those have all NA
values (after converting the blanks (""
) to NA
with na_if
, then remove those columns with -
library(dplyr)
df %>%
select(-(ends_with("_1") & where(~ all(is.na(na_if(as.character(.x), ""))))))
-output
a b b_1 c c_1 d e
1 3 3 3 5 6 NA
2 2 4 NA 8 9 NA
3 1 1 4 9 10 NA
Upvotes: 3
Reputation: 19088
In addition to the condition test for ""
and NA
, using grepl
to filter colnames
.
df[, !(sapply(df, function(x)
(any(x == "", na.rm=T) & any(is.na(x))) |
all(x == "", na.rm=T) & !all(is.na(x))) & grepl("_1$", colnames(df)))]
a b b_1 c c_1 d e e_1
1 3 3 3 5 6 NA NA
2 2 4 NA 8 9 NA NA
3 1 1 4 9 10 NA NA
This also works if there is a column with empty cells together with alphanumeric values.
Upvotes: 1
Reputation: 72828
Do some cleaning and type.convert
columns first.
df[1:10] <- type.convert(df[1:10], as.is=TRUE)
Then it's just
df[!(colSums(is.na(df)) == nrow(df) & grepl('_1', names(df)))]
# a b b_1 c c_1 d e
# 1 3 3 3 NA 5 6 NA
# 2 2 4 NA NA 8 9 NA
# 3 1 1 4 NA 9 10 NA
Upvotes: 1
Reputation: 18425
You could redefine empty_columns
to check whether the column name contains "_1"
...
empty_columns <- colSums(is.na(df) | df == "") == nrow(df) & grepl("_1", names(df))
Upvotes: 4