Reputation: 55
How to avoid using the for loop in the following code to speed up the computation (the real data is about 1e6 times larger)
id = rep(1:5, 20)
v = 1:100
df = data.frame(groupid = id, value = v)
df = dplyr::arrange(df, groupid)
bkt = rep(seq(0, 100, length.out = 4), 5)
id = rep(1:5, each = 4)
bktpts = data.frame(groupid = id, value = bkt)
for (i in 1:5) {
df[df$groupid == i, "bin"] = cut(df[df$groupid == i, "value"],
bktpts[bktpts$groupid == i, "value"],
include.lowest = TRUE, labels = F)
}
Upvotes: 3
Views: 64
Reputation: 66819
Another way:
library(data.table)
setDT(df); setDT(bktpts)
bktpts[, b := rowid(groupid) - 1L]
df[, b := bktpts[copy(.SD), on=.(groupid, value), roll = -Inf, x.b]]
# check result
df[, any(b != bin)]
# [1] FALSE
See ?data.table
for how rolling joins work.
Upvotes: 2
Reputation: 6486
I came out with another data.table
answer:
library(data.table) # load package
# set to data.table
setDT(df)
setDT(bktpts)
# Make a join
df[bktpts[, list(.(value)), by = groupid], bks := V1, on = "groupid"]
# define the bins:
df[, bin := cut(value, bks[[1]], include.lowest = TRUE, labels = FALSE), by = groupid]
# remove the unneeded bks column
df[, bks := NULL]
Explaining the code:
bktpts[, list(.(value)), by = groupid]
is a new table that has in a list al the values of value
for each groupid
. If you run it alone, you'll understand where we're going.
bks := V1
assigns to variable bks
in df
whatever exists in V1
, which is the name of the list column in the previous table. Of course on = "groupid"
is the variable on which we make the join.
The code defining the bins needs little explanation, except by the bks[[1]]
bit. It needs to be [[
in order to access the list values and provide a vector, as required by the cut
function.
EDIT TO ADD:
All data.table commands can be chained in a -rather unintelligible- single call:
df[bktpts[, list(.(value)), by = groupid],
bks := V1,
on = "groupid"][,
bin := cut(value,
bks[[1]],
include.lowest = TRUE,
labels = FALSE),
by = groupid][,
bks := NULL]
Upvotes: 1
Reputation: 27732
I'm not sure why yout bktpts
is formatted like it is?
But here is a data.table slution that should be (at least a bit) faster than your for-loop.
library( data.table )
setDT(df)[ setDT(bktpts)[, `:=`( id = seq_len(.N),
value_next = shift( value, type = "lead", fill = 99999999 ) ),
by = .(groupid) ],
bin := i.id,
on = .( groupid, value >= value, value < value_next ) ][]
Upvotes: 2