Reputation: 20409
For a library call I have to provide a separator, which must not occur in the in the text, because otherwise the library call gets confused.
Now I was wondering how I can adapt my code to assure that the separator I use is guaranteed not to occur in the input text.
I am solving this issue with a while
loop: I make a (hardcoded) assumption about the most unlikely string in the input, check if it is present and if so, just enlarges the string. This works but feels very hackish, so I was wondering whether there is a more elegant version (e.g. an existing base R function, or a loop free solution), which does the same for me? Ideally the found separator is also minimal in length.
I could simply hardcode a large enough set of potential separators and look for the first one not occuring in the text, but this may also break at some point if all of these sepeatirs happen to occur in my input.
Reasoning for that is that even if it will never happen (well never say never), I am afraid that in some distant future there will be this one input string which requires thousands of while
loops before finding an unused string.
input_string <- c("a/b", "a#b", "a//b", "a-b", "a,b", "a.b")
orig_sep <- sep <- "/" ## first guess as a separator
while(any(grepl(sep, input_string, fixed = TRUE))) {
sep <- paste0(sep, orig_sep)
}
print(sep)
# "///"
Upvotes: 2
Views: 222
Reputation: 20409
I made some benchmarks, and the sad news is that only if we have a lot of occurrences of the separator in the input string the regex
solution will pay off. I won't expect long repetitions of the separator, so from that perspective the while
solution should be preferable, but it would be the first time in my R
life that I actually had to rely on a while
construct.
Code
library(microbenchmark)
sep <- "/"
make_input <- function(max_occ, vec_len = 1000) {
paste0("A", strrep(sep, sample(0:max_occ, vec_len, TRUE)))
}
set.seed(1)
no_occ <- make_input(0)
typ_occ <- make_input(1)
mid_occ <- make_input(10)
high_occ <- make_input(100)
while_fun <- function(in_str) {
my_sep <- sep
while(any(grepl(my_sep, in_str, fixed = TRUE))) {
my_sep <- paste0(my_sep, sep)
}
my_sep
}
greg_fun <- function(in_str) {
strrep(sep,
max(sapply(gregexpr(paste0(sep, "+"), in_str),
purrr::attr_getter("match.length")), 0) + 1)
}
microbenchmark(no_occ_w = while_fun(no_occ),
no_occ_r = greg_fun(no_occ),
typ_occ_w = while_fun(typ_occ),
typ_occ_r = greg_fun(typ_occ),
mid_occ_w = while_fun(mid_occ),
mid_occ_r = greg_fun(mid_occ),
high_occ_w = while_fun(high_occ),
high_occ_r = greg_fun(high_occ))
Results
Unit: microseconds
expr min lq mean median uq max neval cld
no_occ_w 12.3 13.30 15.947 14.60 16.55 51.1 100 a
no_occ_r 1074.8 1184.90 1981.637 1253.45 1546.20 7037.9 100 b
typ_occ_w 33.8 36.00 42.842 38.55 41.45 229.2 100 a
typ_occ_r 1090.4 1192.15 2090.526 1283.80 1547.10 8490.7 100 b
mid_occ_w 277.9 283.35 336.466 288.30 309.45 3452.2 100 a
mid_occ_r 1161.6 1269.50 2204.213 1368.45 1789.20 7664.7 100 b
high_occ_w 3736.4 3852.95 4082.844 3962.30 4097.60 6658.3 100 d
high_occ_r 1685.5 1776.15 2819.703 1868.10 4065.00 7960.9 100 c
Upvotes: 0
Reputation: 39737
In case 1 ASCII can be found you can use table
.
tt <- table(factor(strsplit(paste(input_string, collapse = ""), "")[[1]]
, rawToChar(as.raw(32:126), TRUE)))
names(tt)[tt==0]
rawToChar(as.raw(32:126), TRUE)
gives you all ASCII's, which are used as factor levels. And table
counts all cases. If there is at least one 0
you can use it.
In case you need 2 ASCII you can try the following returning all possible delimiters:
x <- rawToChar(as.raw(32:126), TRUE)
x <- c(outer(x, x, paste0))
x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})]
Or for n-ASCII:
orig_sep <- x <- rawToChar(as.raw(32:126), TRUE)
sep <- x[0]
repeat {
sep <- x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})]
if(length(sep) > 0) break;
x <- c(outer(x, orig_sep, paste0))
}
sep
Search for 1-2 ASCII with only a sapply
-loop and taking separator with minimal length.
x <- rawToChar(as.raw(32:126), TRUE)
x <- c(x, outer(x, x, paste0))
x[!sapply(x, function(y) {any(grepl(y, input_string, fixed=TRUE))})][1]
#[1] " "
In case you want to know how many times a character needs to be repeated to work as a separator, as you do it in the question, you can use gregexpr
.
strrep("/", max(sapply(gregexpr("/*", input_string)
, function(x) max(attributes(x)$match.length)))+1)
#[1] "///"
strrep("/", max(c(0, sapply(gregexpr("/+", input_string)
, function(x) max(attributes(x)$match.length))))+1)
#[1] "///"
Upvotes: 2