Reputation: 169
I have a data.table with some values ('value'), lower limits ('min_val') and upper limits ('max_val'):
| value | min_val | max_val |
1: | 94.001 | 94.00 | 94.02 |
2: | 94.002 | 94.00 | 94.03 |
3: | 94.003 | 94.01 | 94.04 |
4: | 95 | 94.98 | 95.02 |
I want to calculate the count of rows where value > min_val & value < max_val for each line for the values from the whole dt.
| value | min_val | max_val | count |
1: | 94.001 | 94.00 | 94.02 | 1 | #(num of value(s) > 94.00 & < 94.02)
2: | 94.002 | 94.00 | 94.03 | 2 |
3: | 94.003 | 94.01 | 94.04 | 2 |
4: | 95 | 94.98 | 95.02 | 1 |
I've tried
dt[, count := nrow(dt[value > dt[min_val] & value < dt[max_val]])]
but I'm on the wrong path.
Upvotes: 6
Views: 782
Reputation: 3650
Borrowing from @chinsoon12 answer, we can sort the values and ranges first, so we can add some tricks in cpp code to skip comparison of out of range values:
cppFunction("
IntegerVector f2(NumericVector v, NumericVector minv, NumericVector maxv) {
int n = v.size();
IntegerVector res(n);
int i2 = 0;
for (int r=0; r<n; r++) {
for (int i=i2; i<n; i++) {
if (v[i] > minv[r]) {
if (v[i] < maxv[r]) {
res[r]++;
} else {
break;
}
} else {
i2 = i;
}
}
}
return res;}")
Crete fake data:
n <- 8e7
set.seed(1)
x <- sample.int(n, n, replace = T)
a <- seq(0, n - g, by = g)
p1 <- data.table(min_val = a, max_val = a + g)
d <- rbindlist(lapply(1:g, function(x) p1))
d[, value := x]
Benchmark:
# to use new rcpp function we need to sort values in new vector:
v2 <- sort(d$value)
d[, id := .I] # add id column if we want to sort the data back
setkey(d, min_val) # sort data.table by min_value
system.time(
d[, count2 := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N]
) # 54 seconds henrik
system.time(
d[, count4 := f2(v2, min_val, max_val)]
) # 1.4 seconds
all.equal(d$count2, d$count4)
# TRUE
Even if we would include time of sorting the data, this should be much faster. The real time of course depends of your exact data. I would suggest first to test on sample of your data, not on all 80e6 rows first...
Upvotes: 1
Reputation: 42544
The OP has disclosed that his production dataset consists of 81 million rows. Unfortunately, r2evans' benchmark used only the 4 rows sample dataset provided with the question and has neglected henrik's suggestion. To find the best approach for a large dataset like OP's one I find it worthwhile to run benchmarks with varying and realistic problem sizes in order to measure run time as well as memory consumption.
Run time and memory consumption may be depend on
Items 1 and 2 are linked as the number of values and the number of intervals is given by the number of rows in the dataset. So, there are two size parameter we can vary, the number of rows n
and the number values within each interval m
. In order to have reproducible and predictable benchmark data, the benchmark datasets are constructed by
d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]
In case of n <- 4
and m <- 1
d
becomes
value min_val max_val count
<num> <num> <num> <int>
1: 1 0 2 -1
2: 2 1 3 -1
3: 3 2 4 -1
4: 4 3 5 -1
In order to create equal conditions for each benchmark run the count
column is pre-allocated with some dummy data.
The benchmark includes
Edit: A 3rd set of benchmark runs compares
Unfortunately, TarJae's answer did not work for me.
library(data.table)
library(bench)
library(ggplot2)
options(datatable.print.class = TRUE)
bm1 <- press(
n = 2^c(2, 10, 13),
m = 2^c(0, 9, 12),
{
d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]
mark(
henrik = {
d[ , count := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N]
},
r2evans0 = {
d[, count := rowSums(outer(seq_len(.N), value, function(i, v) {min_val[i] < v & v < max_val[i];}))]
},
r2evans1 = {
d[, count := mapply(function(mi,ma) sum(mi < value & value < ma), min_val, max_val)]
},
r2evans2 = {
d[, count := rowSums(outer(min_val, d$value, `<`) &
outer(max_val, d$value, `>`)),
by = .(g = seq_len(nrow(d)) %/% 100)]
},
Thomas = {
d[, count := colSums(outer(value, min_val, ">") & outer(value, max_val, "<"))]
},
Alexandre = {
d[, count := lapply(
# seq.int(1, nrow(d)),
seq_len(nrow(d)),
function(i) sum(d[, value] > d[i, min_val] & d[, value] < d[i, max_val])
)]
},
min_iterations = 3
)
}
)
autoplot(bm1)
Please, note the logarithmic time scale.
The chart exhibits that
n
henrik's approach becomes neck and neck with r2evans1
(worth to be investigated further),m
seems to have no or little effect on run time.The latter can be verified by changing the facets and plotting the median times for different m
in one facet:
ggplot(bm1) +
aes(x = median, y = expression, color = as.factor(m)) +
geom_point() +
facet_wrap(vars(n))
In the next chart below, plotting mem_alloc
instead of median
times exhibits that
m
has no effect on memory allocation (with one exception),n
, henrik's approach needs magnitudes less memory than any other approach:Please, note the log scale.
Based on the previous results, the next set of benchmark runs varies only size parameter n
while m
is kept constant. Alexandre's approach is omitted as it is too slow.
n
is varied from 2^10
(1024) to 2^14
(16384) with m = 1.0
. Unfortunately, the run with n = 2^15
was aborted due to lack of memory on my PC.
autoplot(bm2)
henrik's approach has taken lead in terms of speed for the 2^14
(16384) rows case.
To identify if this indicates a trend, run time vs problem size n
is plotted by
ggplot(bm2) +
aes(x = n, y = median, color = expression, group = attr(expression, "description"),
label = ifelse(n == max(n), attr(expression, "description"), "")) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_continuous(trans = scales::log2_trans(), expand = expansion(mult = c(0.05, 0.1))) +
ggrepel::geom_label_repel(direction = "y", nudge_x = 1000)
henrik's approach seems to have a high overhead but gains a speed advantage with increasing problem size.
Also with respect to memory allocation, henrik's approach aggregating in a non-equi self join seems to need substantially less memory than the other approaches. More importantly, memory allocation increases less steep with problem size which indicates that this approach can handle much larger problem sizes when available computer memory is a limiting factor.
This set of benchmark runs compares henrik's aggregate in a non-equi self join with chinsoon12's new Rcpp solution.
Due to the much smaller memory footprint of both approaches the problem size can be increased up to 2^18
(262144) rows before hitting the 16 GB memory limit on my Windows PC.
library(Rcpp)
bm4 <- press(
n = 2^(10:18),
{
m <- 1.
d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]
mark(
henrik = {
d[ , count := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N]
},
chinsoon = {
cppFunction("IntegerVector inrange(NumericVector v, NumericVector minv, NumericVector maxv) {
int n = v.size();
IntegerVector res(n);
for (int r=0; r<n; r++) {
for (int i=0; i<n; i++) {
if (v[i] > minv[r] && v[i] < maxv[r]) {
res[r]++;
}
}
}
return res;
}")
d[, count := inrange(value, min_val, max_val)]
},
min_iterations = 3
)
}
)
The next two charts show the median run time and the memory allocation vs problem size, resp. (please, note the log scales):
Results for n = 2^18
(262144):
setDT(bm4)[n == 2^18, .(expression = unique(attr(expression, "description")),
n, median, mem_alloc)]
expression n median mem_alloc <char> <num> <bench_time> <bench_bytes> 1: henrik 262144 17.47s 12.14MB 2: chinsoon 262144 1.03m 2.06MB
Apparently, chinsoon's approach is faster for problem sizes up to 2^16
(65536) while henrik's approach ist faster for larger problem sizes (and seems to have a more linear time behaviour). For problem size n = 2^18
, henrik's approach is almost 4 times faster than chinsoon's.
On the other hand, henrik's approach allocates much more memory than chinsoon's. For problem size n = 2^18
, henrik's approach allocates about 6 time more memory than chinsoon's. Apparently, this ratio is constant for increasing problem size.
So, there is a tradeoff between speed (henrik's approach) and memory requirement (chinsoon's approach) depending on problem size. Your mileage may vary.
Upvotes: 8
Reputation: 25225
Here is another option using Rcpp
given the size of the actual dataset:
library(Rcpp)
cppFunction("IntegerVector betnRcpp(NumericVector v, NumericVector minv, NumericVector maxv) {
int n = v.size();
IntegerVector res(n);
for (int r=0; r<n; r++) {
for (int i=0; i<n; i++) {
if (v[i] > minv[r] && v[i] < maxv[r]) {
res[r]++;
}
}
}
return res;
}")
#same df as ThomasIsCoding answer
setDT(df)[, count := betnRcpp(value, min_val, max_val)]
Upvotes: 1
Reputation: 160407
Edit, with 23M rows, you'll need a slightly different strategy. Original answer kept below this.
For much larger datasets, outer
(by itself) is not a safe method. Two alternative approaches, both of which will be slower (but that is the tradeoff you must make if you are memory-constrained):
Line-by-line (of min_val
and max_val
); this is the simpler of the two, while giving the desired results.
dat[, count := mapply(function(mi,ma) sum(mi < value & value < ma), min_val, max_val)]
# value min_val max_val count
# <num> <num> <num> <int>
# 1: 94.01 94.00 94.02 1
# 2: 94.02 94.00 94.03 2
# 3: 94.03 94.01 94.04 2
# 4: 95.00 94.98 95.02 1
Faster (and borrowing from ThomasIsCoding's method), grouped, but a little more complicated. Effectively using outer
on n
rows at a time.
dat[, count := rowSums(outer(min_val, dat$value, `<`) &
outer(max_val, dat$value, `>`)),
by = .(g = seq_len(nrow(dat)) %/% 100)]
In this example, we're working on 100 rows at a time for the *_val
variables, and the whole value
column which we saved externally as ovalue
. (You may need to play around to find the best value instead of 100: you'll know when you go too high.)
Original solution provided:
dat[, count := rowSums(outer(seq_len(.N), value, function(i, v) min_val[i] < v & v < max_val[i]))]
# value min_val max_val count
# <num> <num> <num> <num>
# 1: 94.01 94.00 94.02 1
# 2: 94.02 94.00 94.03 2
# 3: 94.03 94.01 94.04 2
# 4: 95.00 94.98 95.02 1
Quick walk-through:
outer(.,.,.)
does an outer-product, where we look at every row-index (seq_len(.N)
) against every value
; the reason we need to compare against the row indices is that we need both min_val
and max_val
, and that isn't easy with a two-argument function;
the function we give to outer
is called once, with two vectors:
cbind(i, v)
# i v
# [1,] 1 94.001
# [2,] 2 94.001
# [3,] 3 94.001
# [4,] 4 94.001
# [5,] 1 94.002
# [6,] 2 94.002
# [7,] 3 94.002
# [8,] 4 94.002
# [9,] 1 94.003
# [10,] 2 94.003
# [11,] 3 94.003
# [12,] 4 94.003
# [13,] 1 95.000
# [14,] 2 95.000
# [15,] 3 95.000
# [16,] 4 95.000
This works fine for us, because min_val[i]
produces 16 numbers, same with max_val
, and v
is already the numbers we want to compare.
the outer(.)
returns a matrix:
# dat[, outer(seq_len(.N), value, function(i, v) min_val[i] < v & v < max_val[i])]
# # [,1] [,2] [,3] [,4]
# # [1,] TRUE FALSE FALSE FALSE
# # [2,] TRUE TRUE FALSE FALSE
# # [3,] FALSE TRUE TRUE FALSE
# # [4,] FALSE FALSE FALSE TRUE
where each column in row 1 represents a row that matches row 1's min_val
and max_val
constraints. For that, then, we just need to rowSums
the matrix to get our 1, 2, 2, and 1.
Note: based on your comments and expected output, I believe your value
should really be 94.01
and not 94.001
(etc).
dat <- setDT(structure(list(value = c(94.01, 94.02, 94.03, 95), min_val = c(94, 94, 94.01, 94.98), max_val = c(94.02, 94.03, 94.04, 95.02)), class = c("data.table", "data.frame"), row.names = c(NA, -4L)))
(Honestly, I was expecting ThomasIsCoding's double-outer
to have some penalty, but apparently not. My guess is that my indirection using i
is just as barely-noticeable as a double-outer
.)
bench::mark(
r2evans = dat[, count := rowSums(outer(seq_len(.N), value, function(i, v) {min_val[i] < v & v < max_val[i];}))],
ThomasIsCoding = dat[, count := colSums(outer(value, min_val, ">") & outer(value, max_val, "<"))],
AlexandreLeonard = dat[,
count := lapply(seq.int(1, nrow(dt)), function(i) {
sum(dt[, value] > dt[i, min_val] & dt[, value] < dt[i, max_val])
})]
)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 r2evans 319.3us 347.7us 2449. 33KB 4.24 1154 2 471ms <data.ta~ <Rprofme~ <bch:~ <tibbl~
# 2 ThomasIsCoding 301.5us 344.05us 2454. 32.6KB 2.05 1196 1 487ms <data.ta~ <Rprofme~ <bch:~ <tibbl~
# 3 AlexandreLeonard 3.77ms 4.44ms 205. 408.8KB 4.31 95 2 464ms <data.ta~ <Rprofme~ <bch:~ <tibbl~
Upvotes: 4
Reputation: 78917
We could use add_count
with between
:
library(dplyr)
df %>%
add_count(between(value, min_val, max_val)) %>%
select(1:4)
value min_val max_val count
1: 94.001 94.00 94.02 3
2: 94.002 94.00 94.03 3
3: 94.003 94.01 94.04 0
4: 95.000 94.98 95.02 1
Upvotes: 1
Reputation: 101099
We can try outer
like this
> setDT(df)[, count := colSums(outer(value, min_val, ">") & outer(value, max_val, "<"))][]
value min_val max_val count
1: 94.001 94.00 94.02 3
2: 94.002 94.00 94.03 3
3: 94.003 94.01 94.04 0
4: 95.000 94.98 95.02 1
> dput(df)
structure(list(value = c(94.001, 94.002, 94.003, 95), min_val = c(94,
94, 94.01, 94.98), max_val = c(94.02, 94.03, 94.04, 95.02)), class = "data.frame", row.names = c(NA,
-4L))
Upvotes: 5
Reputation: 406
This is not nice but do the work:
dt[,
count := lapply(seq.int(1, nrow(dt)), function(i) {
sum(dt[, value] > dt[i, min_val] & dt[, value] < dt[i, max_val])
})
]
Upvotes: 0