petermeissner
petermeissner

Reputation: 12890

Improving performance of a loop with succeeding string replacements?


I have (html-)texts and I want to change the ö things to real characters like ä, ü, ö, and so on because otherwise the xml-package does not accept it.

So I wrote a little function which cycles through a replacement table (link1, link2) and does replace special character by special character by sp... the function looks like this (only looonger):

html.charconv <- function(text){
    replacer <- matrix(c(
    "Á",    "&Aacute;",
    "á",    "&aacute;",
    "Â",    "&Acirc;",
    "â",    "&acirc;",
    "´",    "&acute;"
    )
    ,ncol=2,byrow=T)

    for(i in 1:length(replacer[,1])){
        text <- str_replace_all(text,replacer[i,2],replacer[i,1])
    }
    text
}

How might I speed this up? I thought about vectorization but did not come with any helping solution because for each cycle the result of the last cycle is its starting point.

Upvotes: 4

Views: 290

Answers (4)

Romain Francois
Romain Francois

Reputation: 17642

Just for fun, here is a version based on Rcpp.

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
CharacterVector rcpp_conv( 
    CharacterVector text, CharacterVector old , CharacterVector new_){

    int n  = text.size() ;
    int nr = old.size() ;

    std::string buffer, current_old, current_new ;
    size_t pos, current_size ; 
    CharacterVector res(n) ;

    for( int i=0; i<n; i++){
        buffer = text[i] ;
        for( int j=0; j<nr; j++){
             current_old = old[j] ;
             current_size = current_old.size() ;
             current_new = new_[j] ;
             pos = 0 ;   
             pos = buffer.find( current_old ) ;
             while( pos != std::string::npos ){
                 buffer.replace( 
                     pos, current_size, 
                     current_new
                 ) ;
                 pos = buffer.find( current_old ) ;
             }
        }
        res[i] = buffer ;
    }
    return res ;
}

For which I get quite a further performance gain:

> microbenchmark(
+     html.fastconv( sometext,oldchar,newchar),
+     html.fastconvJC(sometext, oldchar, newchar),
+     rcpp_conv( sometext, oldchar, newchar)
+ )
Unit: microseconds
                                         expr    min      lq   median      uq
1   html.fastconv(sometext, oldchar, newchar) 97.588 99.9845 101.4195 103.072
2 html.fastconvJC(sometext, oldchar, newchar) 19.945 23.3060  25.8110  28.134
3       rcpp_conv(sometext, oldchar, newchar)  4.047  5.1555   6.2340   9.275
      max
1 256.061
2  40.647
3  25.763

Here is an implementation based on the Rcpp::String feature, available from Rcpp >= 0.10.2:

class StringConv{
public:
    typedef String result_type ;
    StringConv( CharacterVector old_, CharacterVector new__): 
        nr(old_.size()), old(old_), new_(new__){}

    String operator()(String text) const {
        for( int i=0; i<nr; i++){
            text.replace_all( old[i], new_[i] ) ;
        }     
        return text ;
    }

private:
    int nr ;
    CharacterVector old ;
    CharacterVector new_ ;
} ;

// [[Rcpp::export]]
CharacterVector test_sapply_string( 
   CharacterVector text, CharacterVector old , CharacterVector new_
){
   CharacterVector res = sapply( text, StringConv( old, new_ ) ) ;
   return res ;
}  

Upvotes: 8

John
John

Reputation: 23758

I'm guessing that 36,000 file read and writes is your bottleneck and the way you code in R can't help much with that. Some things just take a while. Your function looks like it will work right, just let it run. There are a few small improvements you could make.

replacer <- matrix(c(
    "Á",    "&Aacute;",
    "á",    "&aacute;",
    "Â",    "&Acirc;",
    "â",    "&acirc;",
    "´",    "&acute;"
    )
    ,ncol=2, byrow=T)

html.fastconvJC <- function(x,old,new){
    n <- length(new)
    s <- x #make a copy cause I'm scared of scoping in R :)
    for (i in 1:n) s <- gsub(old[i], new[i], s, fixed = TRUE)
    s
    }

# borrowing the strings from Joris Meys
benchmark(html.fastconvJC(sometext, replacer[,2], replacer[,1]),
      html.charconv(sometext), columns = c("test", "elapsed", "relative"),
      replications=1000)

                                                     test elapsed relative
2                                 html.charconv(sometext)   0.727    17.31
1 html.fastconvJC(sometext, replacer[, 2], replacer[, 1])   0.042     1.00

And they increased speed more than I expected. Note that a huge part of that speedup is making fixed = TRUE, otherwise Joris Meys answer comes in about the same speed.

If this doesn't get your far in overall speed you know your bottleneck is elsewhere, likely file reads and writes. Unless you have solid state or RAID drives, running this in parallel isn't going to speed anything up and might just slow it down.

Upvotes: 5

Joris Meys
Joris Meys

Reputation: 108583

You can get a significant speedup by constructing your function a bit different, and forget about the text tools. Basically you :

  1. split the character string
  2. match the characters you want and replace them by the new characters
  3. paste everything together again

You can do that with following function :

html.fastconv <- function(x,old,new){
    xs <- strsplit(x,"&|;")
    old <- gsub("&|;","",old)
    xs <- lapply(xs,function(i){
        id <- match(i,old,0L)
        i[id!=0] <- new[id]
        return(i)
    })
    sapply(xs,paste,collapse="")
}

This works as :

> sometext <- c("&Aacute;dd som&aacute; le&Acirc;tter&acirc; acute problems et&acute; cetera",
+  "&Aacute;dd som&aacute; le&Acirc;tter&acirc; acute p ..." ... [TRUNCATED] 

> newchar <- c("Á","á","Â","â","´")

> oldchar <- c("&Aacute;","&aacute;","&Acirc;","&acirc;","&acute;")
> html.fastconv(sometext,oldchar,newchar)
[1] "Ádd somá leÂtterâ acute problems et´ cetera" "Ádd somá leÂtterâ acute problems et´ cetera"

For the record, some benchmarking :

require(rbenchmark)
benchmark(html.fastconv(sometext,oldchar,newchar),html.charconv(sometext),
     columns=c("test","elapsed","relative"),
     replications=1000) 
                                       test elapsed relative
2                   html.charconv(sometext)    0.79    5.643
1 html.fastconv(sometext, oldchar, newchar)    0.14    1.000

Upvotes: 8

agstudy
agstudy

Reputation: 121598

I will try with plyr :

input.data <- llply(input.files, html.charconv, .parallel=TRUE) 

Upvotes: -1

Related Questions