Reputation: 2133
Hello stackoverflowers,
I wonder if I could use the %like% operator row-wise in the datatable between two columns of the same datatable.
The following reproducible example will make it more clear.
First prepare the data
library(data.table)
iris <- as.data.table(iris)
iris <- iris[seq.int(from = 1, to = 150,length.out = 5)]
iris[, Species2 := c('set', "set|vers", "setosa", "nothing" , "virginica")]
Hence the dataset looks as follows.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
1: 5.1 3.5 1.4 0.2 setosa set
2: 4.9 3.6 1.4 0.1 setosa set|vers
3: 6.4 2.9 4.3 1.3 versicolor setosa
4: 6.4 2.7 5.3 1.9 virginica nothing
5: 5.9 3.0 5.1 1.8 virginica virginica
I would like to use something like the following command row-wise.
iris[Species%like%Species2]
but it does not understand that I want it row-wise. Is that possible? The result should be the 1,2,5 rows.
Upvotes: 2
Views: 115
Reputation: 14360
You can't pass a vector to the pattern
argument of %like%
since it calls upon grepl/grep
and these aren't vectorized. You could use mapply
to call %like%
for each row to get what you want:
iris[mapply(function(x,y) x %like% y, Species, Species2) ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
#1: 5.1 3.5 1.4 0.2 setosa set
#2: 4.9 3.6 1.4 0.1 setosa set|vers
#3: 5.9 3.0 5.1 1.8 virginica virginica
Microbenchmark mainly for my own curiosity, but for anyone else interested:
set.seed(1)
dt <- data.table(Species = replicate(100000, paste0(sample(LETTERS, 6), collapse = "")),
Species2 = replicate(100000, paste0(sample(LETTERS, 3), collapse = "")))
microbenchmark::microbenchmark( mapply = dt[mapply(function(x,y) x %like% y, Species, Species2) ],
by_group1 = dt[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)],
by_group2 = dt[, .SD[Species %like% Species2], by = 1:nrow(dt)],
str_detect = dt[stri_detect_regex(Species, Species2)],
by_species2 = dt[,.SD[Species %like% Species2], by = Species2],
by_species2I = dt[dt[, .I[Species %like% Species2], by = Species2]$V1],
times = 5)
Unit: milliseconds
expr min lq mean median uq max neval
mapply 669.9691 680.2241 700.3758 685.8262 715.8373 750.0224 5
by_group1 10906.2179 10908.0985 10951.5651 10914.7002 11009.0683 11019.7408 5
by_group2 16738.4390 16826.4793 16907.8428 16902.9490 16970.6143 17100.7324 5
str_detect 430.7768 431.1002 432.2279 431.9284 433.3488 433.9855 5
by_species2 2482.7583 2518.6858 2547.5882 2531.4913 2599.0159 2605.9899 5
by_species2I 110.1486 114.6775 115.9223 117.5270 118.5033 118.7553 5
Only ran it 5 times since the by_group*
operations were so slow. Looks like @eddi's method using .I
is that fastest (assuming I have his intended method correct).
Also, re-ran the benchmark using fewer groups, it seems in this case the by_species2I
is still the fastest, and the other by_group*
are still slowest by a lot (makes sense since the # of groups for by_group2
is always the data size and for by_group1
it's going to be close to the data size).
set.seed(1)
dt <- data.table(Species = replicate(100000, paste0(sample(LETTERS, 3), collapse = "")),
Species2 = replicate(100000, paste0(sample(LETTERS, 2), collapse = "")))
Unit: milliseconds
expr min lq mean median uq max neval
mapply 611.83085 617.60180 639.7778 638.49061 652.80619 678.15932 5
by_group1 10021.48177 10121.00419 10145.6305 10123.01354 10213.37976 10249.27339 5
by_group2 15828.21224 15997.56034 16018.9583 16066.07284 16101.40961 16101.53651 5
str_detect 416.44549 419.83585 420.6042 421.69423 421.85359 423.19194 5
by_species2 106.06793 114.02764 115.5364 117.62331 118.04524 121.91770 5
by_species2I 14.22369 14.72001 15.2137 15.24514 15.38371 16.49597 5
Upvotes: 2
Reputation: 49448
The various answers have the right bits and pieces - do it by Species2
:
iris[, .SD[Species %like% Species2], by = Species2]
And if that's slow - use .I instead.
Upvotes: 2
Reputation: 37879
One way would be to group by row:
iris[, .SD[Species %like% Species2], by = 1:5]
# : Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
#1: 1 5.1 3.5 1.4 0.2 setosa set
#2: 2 4.9 3.6 1.4 0.1 setosa set|vers
#3: 5 5.9 3.0 5.1 1.8 virginica virginica
Or as per @docendodiscimus 's comment, in case there are duplicate entries, you can do:
iris[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)]
Upvotes: 2
Reputation: 5263
%like%
is just a wrapper around grepl
, so the pattern (right-hand side) can only be length 1. You should be seeing a warning about this.
The stringi
package lets you vectorize the pattern
argument.
library(stringi)
iris[stri_detect_regex(Species, Species2)]
If you like the operator style instead of the function, you can make your own:
`%vlike%` <- function(x, y) {
stri_detect_regex(x, y)
}
iris[Species %vlike% Species2]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
# 1: 5.1 3.5 1.4 0.2 setosa set
# 2: 4.9 3.6 1.4 0.1 setosa set|vers
# 3: 5.9 3.0 5.1 1.8 virginica virginica
Upvotes: 2