Gabriella
Gabriella

Reputation: 451

Changing spelling for multiple words at a time in R/replacing many words at once

I have a dataset (survey) and a column of birth_country, where people have written their country of birth. An example of it:

    1 america
    2 usa
    3 american
    4 us of a
    5 united states
    6 england
    7 english
    8 great britain
    9 uk 
    10 united kingdom 

how I would like it to look:

1 america
2 america
3 america
4 america
5 america
6 uk
7 uk
8 uk
9 uk
10 uk

I have tried using str_replace to manually insert the different spellings, to replace them with 'america' but when I look at my dataset, nothing has changed e.g.

survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain",  "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")

survey$birth_country <- str_replace(survey$birth_country, ' "united state"|"united statea"|"united states of america"', "america")

thank you in advance

Upvotes: 0

Views: 568

Answers (3)

R me matey
R me matey

Reputation: 685

Looks like the problem is in how you specified your regular expression. Try this (updated based on @Gabriella 's comment, and another tidyverse approach, similar to @MarBIo ):

library(tidyverse)    
survey <- survey %>%
    mutate(birth_country = if_else(
                str_detect(birth_country, 
                           "(united state)|(united statea)|(united states of america)"), #If your regular expression matches any in birth_country
                "america", #Change it to "america"
                birth_country #Otherwise, keep as is.
                ) #end of if_else
           ) #end of mutate

Other people are suggesting you come up with a more complex regular expression, which you can certainly do as well. Consecutive "or" (i.e. "|") statements in your regular expression works though.

Upvotes: 1

rawr
rawr

Reputation: 20811

Come up with some patterns that only match for each country and basically loop over what you are already doing (you can change the replacement below with your favorite function)

survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain",  "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")

## use a _named_ list of regular expressions
## the name will be the replacement string
l <- list(
  america = 'amer|us|states',
  uk = 'eng|brit|king|uk',
  'another country' = 'ano|an co',
  chaz = 'chaz|chop'
)

f <- function(x, list) {
  for (ii in seq_along(list)) {
    x[grepl(list[[ii]], x, ignore.case = TRUE)] <- names(list)[ii]
  }
  x
}

## test it
f(survey$birth_country, l)
# [1] "america" "america" "america" "america" "america" "uk"      "uk"      "uk"      "uk"      "uk"     

within(survey, {
  clean <- f(birth_country, l)
})
#     birth_country   clean
# 1         america america
# 2             usa america
# 3        american america
# 4         us of a america
# 5   united states america
# 6         england      uk
# 7         english      uk
# 8   great britain      uk
# 9              uk      uk
# 10 united kingdom      uk

Note that 1) if you don't give a pattern that matches, nothing will change, but 2) if you give a pattern that matches both countries (e.g., "united"), the first in the list will be used (unless the replacement itself is also matched)

Upvotes: 2

MarBlo
MarBlo

Reputation: 4524

In case you allow tidyverse`s mutate you can do:

library(tidyverse)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain",  "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")

americas <- c("america", "usa", "american", "us of a", "united states")
englands <- c("england", "english", "great britain")
survey %>% 
  mutate(birth_country = ifelse(birth_country %in% americas, 'america', 'UK'))
#>    birth_country
#> 1        america
#> 2        america
#> 3        america
#> 4        america
#> 5        america
#> 6             UK
#> 7             UK
#> 8             UK
#> 9             UK
#> 10            UK

Upvotes: 1

Related Questions