Reputation: 2367
In Canada postals codes are in this form "B7J 6B1".
For cleaning postal codes, I need to replace all typos so that "81J 8BL" becomes "B7J 6B1".
In other words, I need a function that replaces N-th character in a string .str
from "A" to "B" (i.e. if it is A ,it replaces it with B, otherwise it does nothing - similar to str_replace()
function but on a character level):
str_replaceCharacter <- function (.str, N, a, b) { ... }
Ideally, I need a very fast function, so I can run in on millions of records as
dt[, ZIP := str_replaceCharacter (ZIP, 1, "8", "B")] [
, ZIP := str_replaceCharacter (ZIP, 3, "8", "B")] [
, ZIP := str_replaceCharacter (ZIP, 5, "8", "B")] [
, ZIP := str_replaceCharacter (ZIP, 2, "L", "1")] [
, ZIP := str_replaceCharacter (ZIP, 4, "L", "1")] [
, ZIP := str_replaceCharacter (ZIP, 6, "L", "1")] [ and so on - 30 more lines like this]
Upvotes: 1
Views: 168
Reputation: 1159
You can use the following
str_replaceCharacter <- function (my.string, N, a, b) {
characters <- unlist(strsplit(my.string, ""))
if (characters[N] == a)
characters[N] <- b
my.string <- paste(characters, collapse = "")
return(my.string)
}
my.string <- "81J 8BL"
replaced <- str_replaceCharacter(my.string, 2, "1", "7")
replaced
and then simply use the apply
function as this is usually pretty fast. However, if this is still not fast enough I would recommend using mclapply
, which is a multicoe implementation of lapply
.
From a performance perspective, this experiment
str_replaceCharacter <- function (my.string, N, a, b) {
characters <- unlist(strsplit(my.string, ""))
if (characters[N] == a)
characters[N] <- b
my.string <- paste(characters, collapse = "")
return(my.string)
}
my.string <- "81J 8BL"
first <- rep_len(my.string, 10)
second <- rep_len(my.string, 100)
third <- rep_len(my.string, 1000)
fourth <- rep_len(my.string, 10000)
fifth <- rep_len(my.string, 100000)
sixth <- rep_len(my.string, 1000000)
seventh <- rep_len(my.string, 10000000)
Leads to
> system.time(lapply(first, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0 0 0
> system.time(lapply(second, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.001 0.000 0.000
> system.time(lapply(third, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.004 0.000 0.005
> system.time(lapply(fourth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.048 0.000 0.048
> system.time(lapply(fifth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.484 0.000 0.485
> system.time(lapply(sixth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
5.487 0.000 5.487
> system.time(lapply(seventh, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
66.065 0.286 66.356
And the plot for variable elapsed
.
Gregor Thomas answer leads to the following on my computer.
> system.time(lapply(first, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.003 0.000 0.002
>
> system.time(lapply(second, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.000 0.000 0.001
>
> system.time(lapply(third, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.003 0.000 0.004
>
> system.time(lapply(fourth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.037 0.000 0.037
>
> system.time(lapply(fifth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.359 0.000 0.359
>
> system.time(lapply(sixth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
4.990 0.019 5.010
>
> system.time(lapply(seventh, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
49.599 0.167 49.764
And akruns' answer leads to
> system.time(lapply(first, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.003 0.000 0.027
>
> system.time(lapply(second, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.001 0.000 0.001
>
> system.time(lapply(third, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.005 0.000 0.006
>
> system.time(lapply(fourth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.056 0.000 0.055
>
> system.time(lapply(fifth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
0.588 0.000 0.588
>
> system.time(lapply(sixth, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
6.067 0.000 6.065
>
> system.time(lapply(seventh, str_replaceCharacter, N = 2, a = "1", b = "7"))
user system elapsed
81.439 0.016 81.449
The answer from Thomas hence seems to perform best.
Upvotes: 2
Reputation: 887128
We can use str_sub
from stringr
library(stringr)
str_replace_character <- function(string, index, pattern, replacement) {
needs_replacement <- str_sub(string, index, index) == pattern
str_sub(string[needs_replacement], index, index) <- replacement
return(string)
}
str_replace_character(c("B7J 6B1", "81J 8BL", "ABC"), 1, "8", "B")
Upvotes: 1
Reputation: 145775
Don't know just how performant this will be, but here's a base
solution that doesn't rely on regex:
str_replace_character = function(string, index, pattern, replacement) {
needs_replacement = substr(string, index, index) == pattern
substr(string[needs_replacement], index, index) = replacement
return(string)
}
str_replace_character(c("B7J 6B1", "81J 8BL", "ABC"), 1, "8", "B")
# [1] "B7J 6B1" "B1J 8BL" "ABC"
Upvotes: 2