Adam_S
Adam_S

Reputation: 770

remove/replace specific words or phrases from character strings - R

I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.

EG:

"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)

"Fifth District" --> "Fifth" (removes District and space before District)

SPSS syntax:

COMPUTE county=REPLACE(county,' Parish','').

There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.

I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.

Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.

Thank you for looking.

Upvotes: 0

Views: 7489

Answers (3)

Adam_S
Adam_S

Reputation: 770

dataframename$varname <- gsub(" Parish","", dataframename$varname)

Upvotes: 1

Petr Javorik
Petr Javorik

Reputation: 1863

> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"

Legend:

  • ^ Start of pattern.
  • () Group (or token).
  • \w* One or more occurrences of word character more than 1 times.
  • .* one or more occurrences of any character except new line \n.
  • $ end of pattern.
  • \1 Returns group from regexp

Upvotes: 3

Chrisss
Chrisss

Reputation: 3241

Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.

string <- c("Arcadia Parish", "Fifth District")

bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")

trimws( sub(bad_regex, "", string) )

# [1] "Arcadia" "Fifth" 

Upvotes: 1

Related Questions