Reputation: 555

Deleting rows in R based on values over multiple columns

I'm trying to remove rows from a dataframe.

I need to remove rows that have only "NONE" or white space across the entire range of columns I provide. The rows that need to be removed must satisfy these conditions 1. a combination of only "NONE" and white space or all "NONE" or all white space.

Because there are cases where having "NONE" or white space in some of the columns is okay I can't just filter out rows when reading in the csv with something like

dataframe$col1 =="NONE" | str_length(dataframe$col1)==0

I know this would normally be a simple problem where I could run a for loop that turns all "NONE" values and whitespace in a dataframe to NA and use complete.cases across whichever columns I need (doc). However, I'm specifically being asked to use a method that does not change the values. Any suggestions?

Edit: I don't have the data, but here is a made up example of a dataframe that would be similar to what I have to work with

In this dataframe the only row that must be removed is row 3 (or row 4 if you include the header).

The final dataset will have many more columns than this made up example

Upvotes: 1

Answers (4)

Gwang-Jin Kim

Reputation: 10128

is.none <- function(x) tolower(x) == "none"
is.whitespace <- function(x) grepl("^\\s+$", x)
is.empty <- function(x) length(x) == 0 || x == "" || is.na(x) || is.nan(x)
is.none.whitespace.empty <- function(x) is.none(x) || is.whitespace(x) || is.empty(x)

is.none.whitespace.empty <- Vectorize(is.none.whitespace.empty)

remove.empty.rows <- function(df, cols) {
  df[!sapply(1:nrow(df), 
             function(i) all(is.none.whitespace.empty(df[i, cols]))), ]
}

Now you can test it:

# in your case:
remove.empty.rows(df, 1) # remove if first column content is "empty"

# but you can determine which columns should be examined for being all
# "empty".
# let's say, you want to evaluate only first, third and fifth column:
remove.empty.rows(df, c(1, 3, 5))

Upvotes: 1

dmca

Reputation: 695

You can use dplyr::filter_all() to accomplish this:

library(dplyr)

df <- data.frame(column.1 = c('a', 'b', 'NONE', 'b', 'b'),
                 column.2 = c('a', 'b', '', 'b', 'b'),
                 column.3 = rep('', 5),
                 column.4 = rep('', 5),
                 column.5 = rep('', 5))

df %>%
  filter_all(any_vars(. != 'NONE' & . != ''))

Upvotes: 3

pooja p

Reputation: 144

Since the rows with NONE and whitespace need to be removed, it would be:

dataframe <- filter(dataframe, col1 != "NONE" & str_length(col1) != 0)

Upvotes: 2

joeflo

Reputation: 19

I would recommend using the filter() command from the dplyr package (part of the tidyverse library). It would look something like this:

dataframe_new <- filter(dataframe, col1 == "" | str_length(col1) == 0)

Upvotes: 1

Deleting rows in R based on values over multiple columns

Answers (4)

Related Questions