ironv
ironv

Reputation: 1058

speeding up a data.table cartesian product with filtering

Given a data.table with three fields member_id, provider_id and srvc_dt. I need to compute the number of distinct members seen by a pair of providers. A member is said to have visited two providers if the visits have occurred within 180 days. This is used to construct an undirected graph using a threshold for #visits and look for connected components.

I am using the method suggested in Cartesian product with filter data.table.

The instance that I need to run has over 3 million records and it is taking over 5 mins to run. Is there a way of re-writing or a new data.table function so that it runs faster?

require(data.table)

nmem <- 5000
data.dt <- data.table(member_id=sample(10000:1000000,nmem,replace=TRUE), provider_id=sample(1000:2000,nmem,replace=TRUE),
    srvc_dt=sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), nmem, replace=TRUE))
setkey(data.dt, member_id)

prov_pair.dt <- data.dt[data.dt, {
        idx = provider_id<i.provider_id & abs(srvc_dt-i.srvc_dt)<180
        list(provider_id1 = provider_id[idx], 
            srvc_dt1 = srvc_dt[idx],
            provider_id2 = i.provider_id[any(idx)],
            srvc_dt2 = i.srvc_dt[any(idx)]
        )
    }, by=.EACHI, allow=TRUE]

prov_pair_agg.dt <- prov_pair.dt[, .(weight=length(unique(member_id))), .(provider_id1,provider_id2)]

Upvotes: 4

Views: 330

Answers (2)

eddi
eddi

Reputation: 49448

First, filter out the data to only include members that have seen more than one provider:

res = data.dt[, if (.N >= 2) .SD, by = member_id]

Then, add the endpoints for the dates, and a copy of the provider column:

res[, `:=`(start.date = srvc_dt - 180,
           end.date   = srvc_dt + 180,
           provider2  = provider_id)]

Finally, use the new non-equi joins, available in the development version:

res[res, on = .(member_id = member_id, provider2 < provider_id,
                srvc_dt < end.date, srvc_dt > start.date)
    , allow = T, nomatch = 0][, .N, by = .(provider1 = provider_id, provider2)]

One note - the names of the columns in the above join are currently unfortunately a little confusing, and hopefully that will get sorted out soon. You can add extra copies of the columns to see what exactly is going on if the above is too unclear.

Upvotes: 0

webb
webb

Reputation: 4340

simple left-join followed by filtering:

prov_pair.dt <- data.dt[data.dt,allow.cartesian=T][provider_id<i.provider_id &
  abs(srvc_dt-i.srvc_dt)<180,]

provider_id<i.provider_id prevents double-counting the same visits to x,y and y,x.

also, now use provider_id and i.provider_id instead of provider_id1 and provider_id2 in computing prov_pair_agg.dt:

prov_pair_agg.dt <- prov_pair.dt[, .(weight=length(unique(member_id))), 
  .(provider_id,i.provider_id)]

on a 16gb memory machine with nmem=1,000,000, this takes 1.487s vs 106.034s by your current method.

Upvotes: 2

Related Questions