IVIM
IVIM

Reputation: 2367

Replacing N-th character in a string in R

In Canada postals codes are in this form "B7J 6B1". For cleaning postal codes, I need to replace all typos so that "81J 8BL" becomes "B7J 6B1". In other words, I need a function that replaces N-th character in a string .str from "A" to "B" (i.e. if it is A ,it replaces it with B, otherwise it does nothing - similar to str_replace() function but on a character level):

str_replaceCharacter <- function (.str, N, a, b) { ... }

Ideally, I need a very fast function, so I can run in on millions of records as

dt[, ZIP := str_replaceCharacter (ZIP, 1, "8", "B")] [
   , ZIP := str_replaceCharacter (ZIP, 3, "8", "B")] [
   , ZIP := str_replaceCharacter (ZIP, 5, "8", "B")] [
   , ZIP := str_replaceCharacter (ZIP, 2, "L", "1")] [
   , ZIP := str_replaceCharacter (ZIP, 4, "L", "1")] [
   , ZIP := str_replaceCharacter (ZIP, 6, "L", "1")] [ and so on - 30 more lines like this]

Upvotes: 1

Views: 168

Answers (3)

MacOS
MacOS

Reputation: 1159

You can use the following

str_replaceCharacter <- function (my.string, N, a, b) {
  characters <- unlist(strsplit(my.string, ""))
  
  if (characters[N] == a)
    characters[N] <- b
  
  my.string <- paste(characters, collapse = "")
  return(my.string)
}



my.string <- "81J 8BL"

replaced <- str_replaceCharacter(my.string, 2, "1", "7")
replaced

and then simply use the apply function as this is usually pretty fast. However, if this is still not fast enough I would recommend using mclapply, which is a multicoe implementation of lapply.

From a performance perspective, this experiment

str_replaceCharacter <- function (my.string, N, a, b) {
  characters <- unlist(strsplit(my.string, ""))
  
  if (characters[N] == a)
    characters[N] <- b
  
  my.string <- paste(characters, collapse = "")
  return(my.string)
}



my.string <- "81J 8BL"

first     <- rep_len(my.string, 10)
second    <- rep_len(my.string, 100)
third     <- rep_len(my.string, 1000)
fourth    <- rep_len(my.string, 10000)
fifth     <- rep_len(my.string, 100000)
sixth     <- rep_len(my.string, 1000000)
seventh   <- rep_len(my.string, 10000000)

Leads to

> system.time(lapply(first, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
      0       0       0 
> system.time(lapply(second, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.001   0.000   0.000 
> system.time(lapply(third, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.004   0.000   0.005 
> system.time(lapply(fourth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.048   0.000   0.048 
> system.time(lapply(fifth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.484   0.000   0.485 
> system.time(lapply(sixth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  5.487   0.000   5.487 
> system.time(lapply(seventh, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
 66.065   0.286  66.356 

And the plot for variable elapsed. enter image description here

Gregor Thomas answer leads to the following on my computer.

> system.time(lapply(first, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.003   0.000   0.002 
> 
> system.time(lapply(second, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.000   0.000   0.001 
> 
> system.time(lapply(third, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.003   0.000   0.004 
> 
> system.time(lapply(fourth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.037   0.000   0.037 
> 
> system.time(lapply(fifth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.359   0.000   0.359 
> 
> system.time(lapply(sixth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  4.990   0.019   5.010 
> 
> system.time(lapply(seventh, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
 49.599   0.167  49.764 

And akruns' answer leads to

> system.time(lapply(first, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.003   0.000   0.027 
> 
> system.time(lapply(second, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.001   0.000   0.001 
> 
> system.time(lapply(third, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.005   0.000   0.006 
> 
> system.time(lapply(fourth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.056   0.000   0.055 
> 
> system.time(lapply(fifth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  0.588   0.000   0.588 
> 
> system.time(lapply(sixth, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
  6.067   0.000   6.065 
> 
> system.time(lapply(seventh, str_replaceCharacter, N = 2, a = "1", b = "7"))
   user  system elapsed 
 81.439   0.016  81.449 

The answer from Thomas hence seems to perform best.

Upvotes: 2

akrun
akrun

Reputation: 887128

We can use str_sub from stringr

library(stringr)
str_replace_character <- function(string, index, pattern, replacement) {
  needs_replacement <- str_sub(string, index, index) == pattern
  str_sub(string[needs_replacement], index, index) <- replacement
  return(string)
}
str_replace_character(c("B7J 6B1", "81J 8BL", "ABC"), 1, "8", "B")

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 145775

Don't know just how performant this will be, but here's a base solution that doesn't rely on regex:

str_replace_character = function(string, index, pattern, replacement) {
  needs_replacement = substr(string, index, index) == pattern
  substr(string[needs_replacement], index, index) = replacement
  return(string)
}

str_replace_character(c("B7J 6B1", "81J 8BL", "ABC"), 1, "8", "B")
# [1] "B7J 6B1" "B1J 8BL" "ABC"  

Upvotes: 2

Related Questions