Louise
Louise

Reputation: 93

R: Replace Abbreviations\ Words

I have tried to resolve this problem all day but without any improvement.

I am trying to replace the following abbreviations into the following desired words in my dataset:

-Abbreviations: USA, H2O, Type 3, T3, bp

The input data is for example

The desired output is

I have tried the following code but without success:

   data= read.csv(C:"xxxxxxx, header= TRUE")
   lowercase= tolower(data$MESSAGE)
   dict=list("\\busa\\b"= "united states of america", "\\bh2o\\b"= 
   "water", "\\btype 3\\b|\\bt3\\"= "type 3 disease", "\\bbp\\b"= 
   "blood pressure")
   for(i in 1:length(dict1)){
   lowercasea= gsub(paste0("\\b", names(dict)[i], "\\b"), 
   dict[[i]], lowercase)}

I know that I am definitely doing something wrong. Could anyone guide me on this? Thank you in advance.

Upvotes: 1

Views: 720

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627100

If you need to replace only whole words (e.g. bp in Some bp. and not in bpcatalogue) you will have to build a regular expression out of the abbreviations using word boundaries, and - since you have multiword abbreviations - also sort them by length in the descending order (or, e.g. type may trigger a replacement before type three).

An example code:

abbreviations <- c("USA", "H2O", "Type 3", "T3", "bp")
desired_words <- c("United States of America", "Water", "Type 3 Disease", "Type 3 Disease", "blood pressure")
df <- data.frame(abbreviations, desired_words, stringsAsFactors = FALSE)
x <- 'Abbreviations: USA, H2O, Type 3, T3, bp'
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]

library(stringr)
str_replace_all(x, 
    paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b"), 
    function(z) df$desired_words[df$abbreviations==z][[1]][1]
) 

The paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b") code creates a regex like \b(Type 3|USA|H2O|T3|bp)\b, it matches Type 3, or USA, etc. as whole word only as \b is a word boundary. If a match is found, stringr::str_replace_all replaces it with the corresponding desired_word.

See the R demo online.

Upvotes: 1

Related Questions