zlu15
zlu15

Reputation: 55

Bin data within a group using breaks from another DF

How to avoid using the for loop in the following code to speed up the computation (the real data is about 1e6 times larger)

id = rep(1:5, 20)
v = 1:100
df = data.frame(groupid = id, value = v)
df = dplyr::arrange(df, groupid)

bkt = rep(seq(0, 100, length.out = 4), 5)
id = rep(1:5, each = 4)
bktpts = data.frame(groupid = id, value = bkt)

for (i in 1:5) {
  df[df$groupid == i, "bin"] = cut(df[df$groupid == i, "value"],
                                   bktpts[bktpts$groupid == i, "value"],
                                   include.lowest = TRUE, labels = F)
}

Upvotes: 3

Views: 64

Answers (3)

Frank
Frank

Reputation: 66819

Another way:

library(data.table)
setDT(df); setDT(bktpts)

bktpts[, b := rowid(groupid) - 1L]
df[, b := bktpts[copy(.SD), on=.(groupid, value), roll = -Inf, x.b]]

# check result
df[, any(b != bin)]
# [1] FALSE

See ?data.table for how rolling joins work.

Upvotes: 2

PavoDive
PavoDive

Reputation: 6486

I came out with another data.table answer:

library(data.table) # load package

# set to data.table
setDT(df) 
setDT(bktpts)

# Make a join
df[bktpts[, list(.(value)), by = groupid], bks := V1, on = "groupid"]

# define the bins:
df[, bin := cut(value, bks[[1]], include.lowest = TRUE, labels = FALSE), by = groupid]

# remove the unneeded bks column
df[, bks := NULL]

Explaining the code:

bktpts[, list(.(value)), by = groupid] is a new table that has in a list al the values of value for each groupid. If you run it alone, you'll understand where we're going.

bks := V1 assigns to variable bks in df whatever exists in V1, which is the name of the list column in the previous table. Of course on = "groupid" is the variable on which we make the join.

The code defining the bins needs little explanation, except by the bks[[1]] bit. It needs to be [[ in order to access the list values and provide a vector, as required by the cut function.

EDIT TO ADD:

All data.table commands can be chained in a -rather unintelligible- single call:

df[bktpts[, list(.(value)), by = groupid], 
   bks := V1, 
   on = "groupid"][, 
                bin := cut(value, 
                           bks[[1]], 
                           include.lowest = TRUE, 
                           labels = FALSE), 
                by = groupid][, 
                             bks := NULL]

Upvotes: 1

Wimpel
Wimpel

Reputation: 27732

I'm not sure why yout bktpts is formatted like it is?

But here is a data.table slution that should be (at least a bit) faster than your for-loop.

library( data.table )

setDT(df)[ setDT(bktpts)[, `:=`( id = seq_len(.N),
                                 value_next = shift( value, type = "lead", fill = 99999999 ) ),
                         by = .(groupid) ],
           bin := i.id,
           on = .( groupid, value >= value, value < value_next ) ][]

Upvotes: 2

Related Questions