Reputation: 125
I am looking for a way to delete the characters at certain positions within a string in R. For example, if we have a string "1,2,1,1,2,1,1,1,1,2,1,1"
, I want to delete the third, fourth, 7th and 8th position. The operation would make the string: "1,1,2,1,1,1,1,2,1,1"
.
Unfortunately, breaking the string into a list using strsplit is not an option, because the strings I am working with are over 1 million characters long. Considering I have about 2,500 strings, it works out to be quite some time.
Alternatively, finding a way to replace the characters with an empty string ""
would achieve the same purpose - I think. Looking into this line of thought, I came across this StackOverflow post:
R: How can I replace let's say the 5th element within a string?
Unfortunately, the solution suggested is hard to efficiently generalize and the following takes about 60 seconds per input string for a list of 2000 positions to remove:
subchar2 = function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
string = paste(string, substr(inputstring, (memory+1), (num-1)), sep = "")
memory = num
}
string = paste(string, substr(inputstring,(memory+1), nchar(inputstring)),sep = "")
return(string)
}
Looking into the problem, I found a snippet of code, that seems to replace the characters at certain positions with "-"
:
subchar <- function(string, pos) {
for(i in pos) {
string <- gsub(paste("^(.{", i-1, "}).", sep=""), "\\1-", string)
}
return(string)
}
I don't quite understand regular expression (yet), but I have a strong suspicion something along these lines will be much better time-wise than the first code solution. Unfortunately, this subchar function seems to break when the values in pos gets high:
> test = subchar(data[1], 257)
Error in gsub(paste("^(.{", i - 1, "}).", sep = ""), "\\1-", string) :
invalid regular expression '^(.{256}).', reason 'Invalid contents of {}'
I was also considering trying to read in the string data into a table using SQL, but I was hoping that there would be a elegant string solution. The SQL implementation in R to do this seems rather complicated.
Any ideas? Thanks!
Upvotes: 5
Views: 4602
Reputation: 7475
One quick speed fix is to remove the pastes in the for loop
subchar3<-function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
string = c(string,substr(inputstring, (memory+1), (num-1)))
memory = num
}
string = paste(c(string, substr(inputstring,(memory+1), nchar(inputstring))),collapse = "")
return(string)
}
data<-paste(sample(letters,100000,replace=T),collapse='')
remove<-sample(1:nchar(data),200)
remove<-remove[order(remove)]
s2<-subchar2(data,remove)
s3<-subchar3(data,remove)
identical(s2,s3)
#[1] TRUE
> library(rbenchmark)
> benchmark(subchar2(data,remove),subchar3(data,remove),replications=10)
test replications elapsed relative user.self sys.self
1 subchar2(data, remove) 10 43.64 40.78505 39.97 1.9
2 subchar3(data, remove) 10 1.07 1.00000 1.01 0.0
user.child sys.child
1 NA NA
2 NA NA
Upvotes: 2
Reputation: 89097
strsplit
is more than ten times faster if you use fixed = TRUE
. Rough extrapolation and it will take a little over 2 minutes to process your 2,500 strings of 1,000,000 comma separated integers.
N <- 1000000
x <- sample(0:1, N, replace = TRUE)
s <- paste(x, collapse = ",")
# this is a vector of 10 strings
M <- 10
S <- rep(s, M)
system.time(y <- strsplit(S, split = ","))
# user system elapsed
# 6.57 0.00 6.56
system.time(y <- strsplit(S, split = ",", fixed = TRUE))
# user system elapsed
# 0.46 0.03 0.50
This is almost 3 times faster than using scan:
system.time(scan(textConnection(S), sep=",", what="a"))
# Read 10000000 items
# user system elapsed
# 1.21 0.09 1.42
Upvotes: 3
Reputation: 263481
Read them in using scan()
. You can set the separator to be "," and what="a". You can scan
one "line" at a time with nlines=1
and if it is a textConnection
, the "pipeline" will "remember" where it was as of the last read.
x <- paste( sample(0:1, 1000, rep=T), sep=",")
xin <- textConnection(x)
x995 <- scan(xin, sep=",", what="a", nmax=995)
# Read 995 items
x5 <- scan(xin, sep=",", what="a", nmax=995)
# Read 5 items
Here's an illustration with 5 "lines"
> x <- paste( rep( paste(sample(0:1, 50, rep=T), collapse=","), 5), collapse="\n")
> str(x)
chr "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0\n1,0,0,0,0,1,0,0,1,1,1,0,1,"| __truncated__
> xin <- textConnection(x)
> x1 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x2 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x3 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x4 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x5 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x6 <- scan(xin, sep=",", what="a", nlines=1)
Read 0 items
> length(x1)
[1] 50
> length(x1[-c(3,4,7,8)])
[1] 46
> paste(x1, collapse=",")
[1] "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0"
>
Upvotes: 3