Reputation: 1058
Given a data.table
with three fields member_id
, provider_id
and srvc_dt
. I need to compute the number of distinct members seen by a pair of providers. A member is said to have visited two providers if the visits have occurred within 180 days. This is used to construct an undirected graph using a threshold for #visits and look for connected components.
I am using the method suggested in Cartesian product with filter data.table.
The instance that I need to run has over 3 million records and it is taking over 5 mins to run. Is there a way of re-writing or a new data.table function so that it runs faster?
require(data.table)
nmem <- 5000
data.dt <- data.table(member_id=sample(10000:1000000,nmem,replace=TRUE), provider_id=sample(1000:2000,nmem,replace=TRUE),
srvc_dt=sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), nmem, replace=TRUE))
setkey(data.dt, member_id)
prov_pair.dt <- data.dt[data.dt, {
idx = provider_id<i.provider_id & abs(srvc_dt-i.srvc_dt)<180
list(provider_id1 = provider_id[idx],
srvc_dt1 = srvc_dt[idx],
provider_id2 = i.provider_id[any(idx)],
srvc_dt2 = i.srvc_dt[any(idx)]
)
}, by=.EACHI, allow=TRUE]
prov_pair_agg.dt <- prov_pair.dt[, .(weight=length(unique(member_id))), .(provider_id1,provider_id2)]
Upvotes: 4
Views: 330
Reputation: 49448
First, filter out the data to only include members that have seen more than one provider:
res = data.dt[, if (.N >= 2) .SD, by = member_id]
Then, add the endpoints for the dates, and a copy of the provider column:
res[, `:=`(start.date = srvc_dt - 180,
end.date = srvc_dt + 180,
provider2 = provider_id)]
Finally, use the new non-equi joins, available in the development version:
res[res, on = .(member_id = member_id, provider2 < provider_id,
srvc_dt < end.date, srvc_dt > start.date)
, allow = T, nomatch = 0][, .N, by = .(provider1 = provider_id, provider2)]
One note - the names of the columns in the above join are currently unfortunately a little confusing, and hopefully that will get sorted out soon. You can add extra copies of the columns to see what exactly is going on if the above is too unclear.
Upvotes: 0
Reputation: 4340
simple left-join followed by filtering:
prov_pair.dt <- data.dt[data.dt,allow.cartesian=T][provider_id<i.provider_id &
abs(srvc_dt-i.srvc_dt)<180,]
provider_id<i.provider_id
prevents double-counting the same visits to x,y and y,x.
also, now use provider_id
and i.provider_id
instead of provider_id1
and provider_id2
in computing prov_pair_agg.dt
:
prov_pair_agg.dt <- prov_pair.dt[, .(weight=length(unique(member_id))),
.(provider_id,i.provider_id)]
on a 16gb memory machine with nmem=1,000,000, this takes 1.487s vs 106.034s by your current method.
Upvotes: 2