Reputation: 105
The data format looks like: "Gender-Bornyear-name-country" so, data listed as:
[[1]] M-1900-Chambers-us
[[2]] F-1900-Calin-Sanchez-es
[[3]] M-1900-Aboul-Enein-us
...
...
I try to use
strsplit(as.charactoe(data), "\\-")
but some of the name just split to one part, some split to two or three. If I just want to withdarw the country
split_data <- strsplit(as.charactoe(data), "\\-")
lapply(split_data, function(x)x[length(x)])
Is this the best way? How about if I want to take the name out?
Upvotes: 2
Views: 74
Reputation: 23101
Some benchmark test results (all the three options extract the country names with the sample data provided), gsub is the fastest:
unlist(lapply(strsplit(as.character(data), "\\-"), function(x)x[length(x)]))
#[1] "us" "es" "us"
gsub('.*-([^-]+)$', '\\1', data)
#[1] "us" "es" "us"
do.call(rbind, str_match_all(data, '.*-([^-]+)$'))[,2]
#[1] "us" "es" "us"
library(stringr)
library(microbenchmark)
check.identical <- function(values) {
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
microbenchmark(unlist(lapply(strsplit(as.character(data), "\\-"), function(x)x[length(x)])),
gsub('.*-([^-]+)$', '\\1', data),
do.call(rbind, str_match_all(data, '.*-([^-]+)$'))[,2],
check=check.identical)
Unit: microseconds
expr min lq mean median uq max neval cld
unlist(lapply(strsplit(as.character(data), "\\\\-"), function(x) x[length(x)])) 15.396 16.4655 20.09603 18.3895 20.3145 87.670 100 b
gsub(".*-([^-]+)$", "\\\\1", data) 11.975 13.6850 15.31916 15.3960 16.6790 27.799 100 a
do.call(rbind, str_match_all(data, ".*-([^-]+)$"))[, 2] 35.923 37.6340 43.93346 39.7720 41.4830 149.679 100 c
Upvotes: 1