vagabond
vagabond

Reputation: 3594

Replace specific characters in a variable in data frame in R

I want to replace all ,, -, ), ( and (space) with . from the variable DMA.NAME in the example data frame. I referred to three posts and tried their approaches but all failed.:

Replacing column values in data frame, not included in list

R replace all particular values in a data frame

Replace characters from a column of a data frame R

Approach 1

> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")")
c$DMA.NAME[shouldbecomeperiod] <- "."

Approach 2

> removetext <- c("-", ",", " ", "(", ")")
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME)
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE)

Warning message:
In gsub(removetext, ".", c$DMA.NAME) :
  argument 'pattern' has length > 1 and only the first element will be used

Approach 3

> c[c == c(" ", ",", "(", ")", "-")] <- "."

Sample data frame

> df
DMA.CODE                  DATE                   DMA.NAME       count
111         22 8/14/2014 12:00:00 AM               Columbus, OH     1
112         23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn     1
79          18 7/30/2014 12:00:00 AM        Boston (Manchester)     1
99          22 8/20/2014 12:00:00 AM               Columbus, OH     1
112.1       23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn     1
208         27 7/31/2014 12:00:00 AM       Minneapolis-St. Paul     1

I know the problem - gsub uses pattern and only first element . The other two approaches are searching the entire variable for the exact value instead of searching within value for specific characters.

Upvotes: 4

Views: 4140

Answers (2)

bartektartanus
bartektartanus

Reputation: 16080

If your data frame is big you might want to look at this fast function from stringi package. This function replaces every character of specific class for another. In this case character class is L - letters (inside {}), but big P (before {}) indicates that we are looking for the complements of this set, so for every non letter character. Merge indicates that consecutive matches should be merged into a single one.

require(stringi)
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T)
## [1] "Columbus.OH"                "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."         "Columbus.OH"               
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"   

And some benchmarks:

x <- sample(df$DMA.NAME, 1000, T)
gsubFun <- function(x){
    gsub("[[:punct:][:space:]]+","\\.",x)   
}

striFun <- function(x){
    stri_replace_all_charclass(x, "\\P{L}",".", T)  
}


require(microbenchmark)
microbenchmark(gsubFun(x), striFun(x))
Unit: microseconds
       expr      min        lq   median        uq       max neval
 gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984   100
 striFun(x)  877.259  893.3945  907.769  929.8065  3189.017   100

Upvotes: 3

nrussell
nrussell

Reputation: 18612

You can use the special groups [:punct:] and [:space:] inside of a pattern group ([...]) like this:

df <- data.frame(
  DMA.NAME = c(
    "Columbus, OH",
    "Orlando-Daytona Bch-Melbrn",
    "Boston (Manchester)",
    "Columbus, OH",
    "Orlando-Daytona Bch-Melbrn",
    "Minneapolis-St. Paul"),
  stringsAsFactors=F)
##
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME)
[1] "Columbus.OH"                "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."         "Columbus.OH"               
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"

Upvotes: 4

Related Questions