DPatrick
DPatrick

Reputation: 431

R: how to filter out rows that end with a specific list characters?

I have a data frame that looks like this:

+-----------------+--------+
| Geography       | Values |
+-----------------+--------+
| Atlanta, GA     | 78     |
+-----------------+--------+
| New York, NY    | 30     |
+-----------------+--------+
| Denver, CO      | 20     |
+-----------------+--------+
| Omaha, NE       | 178    |
+-----------------+--------+
| Los Angeles, CA | 58     |
+-----------------+--------+
| Providence, RI  | 100    |
+-----------------+--------+
| Little Rock, AR | 20     |
+-----------------+--------+
| Miami, FL       | 50     |
+-----------------+--------+
| ...             |        |
+-----------------+--------+

I would look to perform an operation in tidyverse/dplyr format so that I can filter out any rows that is from the state of GA & CA. Notice that there is always a ", " (a comma, followed by a space) before the state abbreviation.

The resulting dataframe should look like:

+-----------------+--------+
| Geography       | Values |
+-----------------+--------+
| New York, NY    | 30     |
+-----------------+--------+
| Denver, CO      | 20     |
+-----------------+--------+
| Omaha, NE       | 178    |
+-----------------+--------+
| Providence, RI  | 100    |
+-----------------+--------+
| Little Rock, AR | 20     |
+-----------------+--------+
| Miami, FL       | 50     |
+-----------------+--------+
| ...             |        |
+-----------------+--------+

The real data is much larger than this simple example. It is consists of hundreds of cities with multiple cities in a state, so I can not simply do something like:

data %>%
filter (Geography == "Atlanta, GA" | Geography == "Los Angeles, CA")

Should I create a new "State" column that takes out the last 2 letters of the "Geography" column, and filter on that "State" column, or can I do something regex related such as:

exclude_list = c("GA, CA")

data %>%
filter (Geography != end_with(exclude_list))

What is an elegant way to do this? Thanks so much for your help!

Upvotes: 0

Views: 3099

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 388862

You can construct exclude_list as :

exclude_list = c("GA", "CA")

Then use subset as :

subset(data, !grepl(sprintf('(%s)$', 
                    paste0(exclude_list, collapse = '|')), Geography))

Or if you need dplyr answer do :

library(dplyr)
data %>%
  filter(!grepl(sprintf('(%s)$', 
                paste0(exclude_list, collapse = '|')), Geography))

where

sprintf('(%s)$', paste0(exclude_list, collapse = '|')) #returns
#[1] "(GA|CA)$"

If exclude_list is too big the regex answer might fail in such case suggestion by @thelatemail would be helpful where we keep only the state name and match them with %in% :

data[!sub("^.+,\\s+", "", data$Geography) %in% exclude_list,]

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101189

A data.table option with grepl

> setDT(df)[!grepl(",\\s+(CA|GA)$", Geography)]
         Geography Values
1:    New York, NY     30
2:      Denver, CO     20
3:       Omaha, NE    178
4:  Providence, RI    100
5: Little Rock, AR     20
6:       Miami, FL     50

or subset if you are with base R

> subset(df, !grepl(",\\s+(CA|GA)$", Geography))
        Geography Values
2    New York, NY     30
3      Denver, CO     20
4       Omaha, NE    178
6  Providence, RI    100
7 Little Rock, AR     20
8       Miami, FL     50

Data

> dput(df)
structure(list(Geography = c("Atlanta, GA", "New York, NY", "Denver, CO", 
"Omaha, NE", "Los Angeles, CA", "Providence, RI", "Little Rock, AR",
"Miami, FL"), Values = c(78L, 30L, 20L, 178L, 58L, 100L, 20L,
50L)), class = c("data.table", "data.frame"), row.names = c(NA,
-8L))

Upvotes: 0

Simon.S.A.
Simon.S.A.

Reputation: 6931

I would recommend doing this with regex. The $ in a regular expression indicates the end of the line. So grepl(" CA$", Geography) will return true if the geography ends with a space and the letters CA.

Hence I would do something like:

data %>%
  filter(!grepl(" CA$", Geography),
         !grepl(" GA$", Geography))

Upvotes: 0

Related Questions