Reputation: 12890
I have (html-)texts and I want to change the ö
things to real characters like ä, ü, ö, and so on because otherwise the xml-package does not accept it.
So I wrote a little function which cycles through a replacement table (link1, link2) and does replace special character by special character by sp... the function looks like this (only looonger):
html.charconv <- function(text){
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2,byrow=T)
for(i in 1:length(replacer[,1])){
text <- str_replace_all(text,replacer[i,2],replacer[i,1])
}
text
}
How might I speed this up? I thought about vectorization but did not come with any helping solution because for each cycle the result of the last cycle is its starting point.
Upvotes: 4
Views: 290
Reputation: 17642
Just for fun, here is a version based on Rcpp
.
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
CharacterVector rcpp_conv(
CharacterVector text, CharacterVector old , CharacterVector new_){
int n = text.size() ;
int nr = old.size() ;
std::string buffer, current_old, current_new ;
size_t pos, current_size ;
CharacterVector res(n) ;
for( int i=0; i<n; i++){
buffer = text[i] ;
for( int j=0; j<nr; j++){
current_old = old[j] ;
current_size = current_old.size() ;
current_new = new_[j] ;
pos = 0 ;
pos = buffer.find( current_old ) ;
while( pos != std::string::npos ){
buffer.replace(
pos, current_size,
current_new
) ;
pos = buffer.find( current_old ) ;
}
}
res[i] = buffer ;
}
return res ;
}
For which I get quite a further performance gain:
> microbenchmark(
+ html.fastconv( sometext,oldchar,newchar),
+ html.fastconvJC(sometext, oldchar, newchar),
+ rcpp_conv( sometext, oldchar, newchar)
+ )
Unit: microseconds
expr min lq median uq
1 html.fastconv(sometext, oldchar, newchar) 97.588 99.9845 101.4195 103.072
2 html.fastconvJC(sometext, oldchar, newchar) 19.945 23.3060 25.8110 28.134
3 rcpp_conv(sometext, oldchar, newchar) 4.047 5.1555 6.2340 9.275
max
1 256.061
2 40.647
3 25.763
Here is an implementation based on the Rcpp::String
feature, available from Rcpp >= 0.10.2
:
class StringConv{
public:
typedef String result_type ;
StringConv( CharacterVector old_, CharacterVector new__):
nr(old_.size()), old(old_), new_(new__){}
String operator()(String text) const {
for( int i=0; i<nr; i++){
text.replace_all( old[i], new_[i] ) ;
}
return text ;
}
private:
int nr ;
CharacterVector old ;
CharacterVector new_ ;
} ;
// [[Rcpp::export]]
CharacterVector test_sapply_string(
CharacterVector text, CharacterVector old , CharacterVector new_
){
CharacterVector res = sapply( text, StringConv( old, new_ ) ) ;
return res ;
}
Upvotes: 8
Reputation: 23758
I'm guessing that 36,000 file read and writes is your bottleneck and the way you code in R can't help much with that. Some things just take a while. Your function looks like it will work right, just let it run. There are a few small improvements you could make.
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2, byrow=T)
html.fastconvJC <- function(x,old,new){
n <- length(new)
s <- x #make a copy cause I'm scared of scoping in R :)
for (i in 1:n) s <- gsub(old[i], new[i], s, fixed = TRUE)
s
}
# borrowing the strings from Joris Meys
benchmark(html.fastconvJC(sometext, replacer[,2], replacer[,1]),
html.charconv(sometext), columns = c("test", "elapsed", "relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.727 17.31
1 html.fastconvJC(sometext, replacer[, 2], replacer[, 1]) 0.042 1.00
And they increased speed more than I expected. Note that a huge part of that speedup is making fixed = TRUE
, otherwise Joris Meys answer comes in about the same speed.
If this doesn't get your far in overall speed you know your bottleneck is elsewhere, likely file reads and writes. Unless you have solid state or RAID drives, running this in parallel isn't going to speed anything up and might just slow it down.
Upvotes: 5
Reputation: 108583
You can get a significant speedup by constructing your function a bit different, and forget about the text tools. Basically you :
You can do that with following function :
html.fastconv <- function(x,old,new){
xs <- strsplit(x,"&|;")
old <- gsub("&|;","",old)
xs <- lapply(xs,function(i){
id <- match(i,old,0L)
i[id!=0] <- new[id]
return(i)
})
sapply(xs,paste,collapse="")
}
This works as :
> sometext <- c("Ádd somá leÂtterâ acute problems et´ cetera",
+ "Ádd somá leÂtterâ acute p ..." ... [TRUNCATED]
> newchar <- c("Á","á","Â","â","´")
> oldchar <- c("Á","á","Â","â","´")
> html.fastconv(sometext,oldchar,newchar)
[1] "Ádd somá leÂtterâ acute problems et´ cetera" "Ádd somá leÂtterâ acute problems et´ cetera"
For the record, some benchmarking :
require(rbenchmark)
benchmark(html.fastconv(sometext,oldchar,newchar),html.charconv(sometext),
columns=c("test","elapsed","relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.79 5.643
1 html.fastconv(sometext, oldchar, newchar) 0.14 1.000
Upvotes: 8
Reputation: 121598
I will try with plyr :
input.data <- llply(input.files, html.charconv, .parallel=TRUE)
Upvotes: -1