Gordon Freeman
Gordon Freeman

Reputation: 125

Delete characters at positions within a string in R?

I am looking for a way to delete the characters at certain positions within a string in R. For example, if we have a string "1,2,1,1,2,1,1,1,1,2,1,1", I want to delete the third, fourth, 7th and 8th position. The operation would make the string: "1,1,2,1,1,1,1,2,1,1".

Unfortunately, breaking the string into a list using strsplit is not an option, because the strings I am working with are over 1 million characters long. Considering I have about 2,500 strings, it works out to be quite some time.

Alternatively, finding a way to replace the characters with an empty string "" would achieve the same purpose - I think. Looking into this line of thought, I came across this StackOverflow post:

R: How can I replace let's say the 5th element within a string?

Unfortunately, the solution suggested is hard to efficiently generalize and the following takes about 60 seconds per input string for a list of 2000 positions to remove:

subchar2 = function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
    string = paste(string, substr(inputstring, (memory+1), (num-1)), sep = "")
    memory = num
}
string = paste(string, substr(inputstring,(memory+1), nchar(inputstring)),sep = "")
return(string)
}

Looking into the problem, I found a snippet of code, that seems to replace the characters at certain positions with "-":

subchar <- function(string, pos) {
        for(i in pos) {
            string <- gsub(paste("^(.{", i-1, "}).", sep=""), "\\1-", string)
        }
        return(string)
}

I don't quite understand regular expression (yet), but I have a strong suspicion something along these lines will be much better time-wise than the first code solution. Unfortunately, this subchar function seems to break when the values in pos gets high:

> test = subchar(data[1], 257)
Error in gsub(paste("^(.{", i - 1, "}).", sep = ""), "\\1-", string) :
invalid regular expression '^(.{256}).', reason 'Invalid contents of {}'

I was also considering trying to read in the string data into a table using SQL, but I was hoping that there would be a elegant string solution. The SQL implementation in R to do this seems rather complicated.

Any ideas? Thanks!

Upvotes: 5

Views: 4602

Answers (3)

shhhhimhuntingrabbits
shhhhimhuntingrabbits

Reputation: 7475

One quick speed fix is to remove the pastes in the for loop

subchar3<-function(inputstring, pos){
string = ""
memory = 0
for(num in pos){
    string = c(string,substr(inputstring, (memory+1), (num-1)))
    memory = num
}
string = paste(c(string, substr(inputstring,(memory+1), nchar(inputstring))),collapse = "")
return(string)
}
data<-paste(sample(letters,100000,replace=T),collapse='')
remove<-sample(1:nchar(data),200)
remove<-remove[order(remove)]
s2<-subchar2(data,remove)
s3<-subchar3(data,remove)
identical(s2,s3)
#[1] TRUE

> library(rbenchmark)
> benchmark(subchar2(data,remove),subchar3(data,remove),replications=10)
                    test replications elapsed relative user.self sys.self
1 subchar2(data, remove)           10   43.64 40.78505     39.97      1.9
2 subchar3(data, remove)           10    1.07  1.00000      1.01      0.0
  user.child sys.child
1         NA        NA
2         NA        NA

Upvotes: 2

flodel
flodel

Reputation: 89097

strsplit is more than ten times faster if you use fixed = TRUE. Rough extrapolation and it will take a little over 2 minutes to process your 2,500 strings of 1,000,000 comma separated integers.

N <- 1000000
x <- sample(0:1, N, replace = TRUE)
s <- paste(x, collapse = ",")

# this is a vector of 10 strings
M <- 10
S <- rep(s, M)

system.time(y <- strsplit(S, split = ","))
# user  system elapsed 
# 6.57    0.00    6.56 
system.time(y <- strsplit(S, split = ",", fixed = TRUE))
# user  system elapsed 
# 0.46    0.03    0.50

This is almost 3 times faster than using scan:

system.time(scan(textConnection(S), sep=",", what="a"))
# Read 10000000 items
# user  system elapsed 
# 1.21    0.09    1.42

Upvotes: 3

IRTFM
IRTFM

Reputation: 263481

Read them in using scan(). You can set the separator to be "," and what="a". You can scan one "line" at a time with nlines=1 and if it is a textConnection, the "pipeline" will "remember" where it was as of the last read.

x <- paste( sample(0:1, 1000, rep=T), sep=",")
xin <- textConnection(x)

x995 <- scan(xin, sep=",", what="a", nmax=995)
# Read 995 items
x5 <- scan(xin, sep=",", what="a", nmax=995)
# Read 5 items

Here's an illustration with 5 "lines"

> x <- paste( rep( paste(sample(0:1, 50, rep=T), collapse=","),  5),  collapse="\n")
> str(x)
 chr "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0\n1,0,0,0,0,1,0,0,1,1,1,0,1,"| __truncated__
> xin <- textConnection(x)
> x1 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x2 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x3 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x4 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x5 <- scan(xin, sep=",", what="a", nlines=1)
Read 50 items
> x6 <- scan(xin, sep=",", what="a", nlines=1)
Read 0 items
> length(x1)
[1] 50
> length(x1[-c(3,4,7,8)])
[1] 46
> paste(x1, collapse=",")
[1] "1,0,0,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,1,1,1,1,0,0,0,1,0,0"
> 

Upvotes: 3

Related Questions