Synthesize an integer key for a data.table

Question

I have a data.table with data grouped by 2 columns key1 and key2. What I would like to achieve is to get a single integer key enumerating over all present pairs of (key1, key2). E.g. to a dataset that looks like this:

    key1  key2  data
    A     1     ...
    A     1     ...
    A     2     ...
    C     1     ...
    B     1     ...
    B     1     ...
    C     1     ...

I would like to have an id column added that would enumerate things:

  id  key1  key2    data
  1     A     1     ...
  1     A     1     ...
  2     A     2     ...
  3     C     1     ...
  4     B     1     ...
  4     B     1     ...
  3     C     1     ...

I don't care the order in which ids are distributed, I only need to be able to subset the table based on single number, not a pair of key columns. How can this be achieved?

MichaelChirico · Accepted Answer

There are a couple of options.

You could do either:

dt[unique(dt,by=c("key1","key2"))[,id:=.I],id:=i.id,on=c("key1","key2")]

Or (as gracefully pointed out by @Khashaa in the comments; in fact, this is probably the way favored by @Arun, given the function was created for cases just like this)

setkey(dt,key1,key2)[,id:=rleid(key1,key2)]

Note that the output from the second will be different from the first, in general--the latter will be sorted by key1 and key2; this isn't necessarily the case for the former. This may or may not be more efficient, depending on your data. If you want to force the first case to be sorted, simply amend to [order(key1,key2),id:=.I]

Synthesize an integer key for a data.table

Answers (1)

Related Questions