Reputation: 431
I have a data frame that looks like this:
+-----------------+--------+
| Geography | Values |
+-----------------+--------+
| Atlanta, GA | 78 |
+-----------------+--------+
| New York, NY | 30 |
+-----------------+--------+
| Denver, CO | 20 |
+-----------------+--------+
| Omaha, NE | 178 |
+-----------------+--------+
| Los Angeles, CA | 58 |
+-----------------+--------+
| Providence, RI | 100 |
+-----------------+--------+
| Little Rock, AR | 20 |
+-----------------+--------+
| Miami, FL | 50 |
+-----------------+--------+
| ... | |
+-----------------+--------+
I would look to perform an operation in tidyverse/dplyr format so that I can filter out any rows that is from the state of GA & CA. Notice that there is always a ", " (a comma, followed by a space) before the state abbreviation.
The resulting dataframe should look like:
+-----------------+--------+
| Geography | Values |
+-----------------+--------+
| New York, NY | 30 |
+-----------------+--------+
| Denver, CO | 20 |
+-----------------+--------+
| Omaha, NE | 178 |
+-----------------+--------+
| Providence, RI | 100 |
+-----------------+--------+
| Little Rock, AR | 20 |
+-----------------+--------+
| Miami, FL | 50 |
+-----------------+--------+
| ... | |
+-----------------+--------+
The real data is much larger than this simple example. It is consists of hundreds of cities with multiple cities in a state, so I can not simply do something like:
data %>%
filter (Geography == "Atlanta, GA" | Geography == "Los Angeles, CA")
Should I create a new "State" column that takes out the last 2 letters of the "Geography" column, and filter on that "State" column, or can I do something regex related such as:
exclude_list = c("GA, CA")
data %>%
filter (Geography != end_with(exclude_list))
What is an elegant way to do this? Thanks so much for your help!
Upvotes: 0
Views: 3099
Reputation: 388862
You can construct exclude_list
as :
exclude_list = c("GA", "CA")
Then use subset
as :
subset(data, !grepl(sprintf('(%s)$',
paste0(exclude_list, collapse = '|')), Geography))
Or if you need dplyr
answer do :
library(dplyr)
data %>%
filter(!grepl(sprintf('(%s)$',
paste0(exclude_list, collapse = '|')), Geography))
where
sprintf('(%s)$', paste0(exclude_list, collapse = '|')) #returns
#[1] "(GA|CA)$"
If exclude_list
is too big the regex answer might fail in such case suggestion by @thelatemail would be helpful where we keep only the state name and match them with %in%
:
data[!sub("^.+,\\s+", "", data$Geography) %in% exclude_list,]
Upvotes: 1
Reputation: 101189
A data.table
option with grepl
> setDT(df)[!grepl(",\\s+(CA|GA)$", Geography)]
Geography Values
1: New York, NY 30
2: Denver, CO 20
3: Omaha, NE 178
4: Providence, RI 100
5: Little Rock, AR 20
6: Miami, FL 50
or subset
if you are with base R
> subset(df, !grepl(",\\s+(CA|GA)$", Geography))
Geography Values
2 New York, NY 30
3 Denver, CO 20
4 Omaha, NE 178
6 Providence, RI 100
7 Little Rock, AR 20
8 Miami, FL 50
Data
> dput(df)
structure(list(Geography = c("Atlanta, GA", "New York, NY", "Denver, CO",
"Omaha, NE", "Los Angeles, CA", "Providence, RI", "Little Rock, AR",
"Miami, FL"), Values = c(78L, 30L, 20L, 178L, 58L, 100L, 20L,
50L)), class = c("data.table", "data.frame"), row.names = c(NA,
-8L))
Upvotes: 0
Reputation: 6931
I would recommend doing this with regex. The $
in a regular expression indicates the end of the line. So grepl(" CA$", Geography)
will return true if the geography ends with a space and the letters CA.
Hence I would do something like:
data %>%
filter(!grepl(" CA$", Geography),
!grepl(" GA$", Geography))
Upvotes: 0