McRacoon
McRacoon

Reputation: 25

Speed up creation and filling of large matrices

I have a dataframe of 5m observations of which a simplified version looks like this:

df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=c("A","B","C"), seller =c("B","A","D"),amount=c(1,4,2))

The above example reads like this: On date 2020-05-05, agent A buys 1 amount off agent B and so on.

In the dataset there are about 500k unique buyers and sellers on more than 800 different dates.

For each date I would like to create a nxn matrix which represents the daily change in inventory for agents who were trading. This daily calculated matrix should then be stored in a list. So for the above example, the result would be:

╔══════════════╗
║  A  B  C  D  ║
╠══════════════╣
║ A 0 -3  0  0 ║
║ B +3 0  0  0 ║
║ C 0  0  0  2 ║
║ D 0  0 -2  0 ║
╚══════════════╝

Agent A first buys 1 unit from agent B but then sells back 4 and therefore has -3.

My code looks like this:

library("tidyverse")

df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=as.character(c("A","B","C")), seller =as.character(c("B","A","D")),amount=c(1,4,2))

    daily_matrices <- list() #create empty list to store matrices
    dates <- unique(as.Date(df$date))

for (i in 1: length(dates)) { # loop over every date
  loop_date <- dates[i]
  daily_subset <- df %>% filter(date==loop_date) #filter data for each date
  daily_subset_long <- daily_subset %>%
    gather(key="var", value="agent",buyer,seller) 
  daily_agents <- distinct(daily_subset_long, agent) # find unique agents
  daily_pairs<-combn(daily_agents$agent,2) # find each possible pair
  ndim <- dim(daily_agents)[1]
  daily_matrices[[i]] <- matrix(data=0,nrow=ndim, ncol=ndim) #span matrix
  colnames(daily_matrices[[i]])<-daily_agents$agent #name columns with agents
  rownames(daily_matrices[[i]])<-daily_agents$agent #name rows with agents

  for (j in 1: dim(daily_pairs)[2]) { # for each possible pair call below function 
    trading_partner(daily_pairs[1,j],daily_pairs[2,j])
  }
  print(i) # just to track progress

}

trading_partner <-function(x,y) {
    agent_daily_subset <- daily_subset %>% filter(buyer== x & seller== y | buyer== y & seller== x) # filter trades for each pair

  agent_daily_subset_long <- agent_daily_subset %>%
    gather(key="var", value="agent",buyer,seller) 

  agent_daily_subset_long <- agent_daily_subset_long %>% group_by(agent) %>%
    mutate(delta_inventory = case_when(var =="buyer" ~ amount,
                                    var =="seller" ~ -amount)) # calculates change in inventory for each trade

  subgroup_inventory <- agent_daily_subset_long %>% group_by(agent) %>% summarise(inventory = sum(delta_inventory)) # summarisses change in inventory for each of the two agents in a pair
  if (dim(subgroup_inventory)[1] >0) { #if there has been a trade between the pair paste the inventory change in the list of matrices and find the correct row and column by the name of the agents

  daily_matrices[[i]][as.character(subgroup_inventory[1,1]),as.character(subgroup_inventory[2,1])] <<- as.double(subgroup_inventory[1,2])
  daily_matrices[[i]][as.character(subgroup_inventory[2,1]),as.character(subgroup_inventory[1,1])] <<- as.double(subgroup_inventory[2,2])


    }
  }

This is working as expected but I run into issues as the in the original dataset has roughly 1000 different agents per day and I therefore create huge matrices.

I am aware that using loops in R is not the preferred way in the first place but was not able to come up with another solution. The above code takes approximately 30 minutes for each daily matrix. With 800 days it would take over 2 weeks to finish without hiccup.

Is it possible to speed things up in a best practice way?

Upvotes: 0

Views: 39

Answers (1)

r2evans
r2evans

Reputation: 160597

Try this:

library(dplyr)
library(tidyr)
df %>%
  group_by(date) %>%
  do(bind_rows(., transmute(., date, b = buyer, buyer = seller, seller = b, amount = -amount) %>%
  select(-b))) %>%
  group_by(date, buyer, seller) %>%
  summarize(amount = sum(amount)) %>%
  group_by(date) %>%
  complete(buyer=c(buyer,seller), seller=c(buyer,seller), fill = list(amount = 0)) %>%
  ungroup() %>%
  pivot_wider(names_from = seller, values_from = amount, values_fill=list(amount=0))
# # A tibble: 4 x 6
#   date       buyer     A     B     C     D
#   <date>     <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020-05-05 A         0    -3     0     0
# 2 2020-05-05 B         3     0     0     0
# 3 2020-05-05 C         0     0     0     2
# 4 2020-05-05 D         0     0    -2     0

FYI: the recommended functions in tidyr for reshaping are now pivot_longer and pivot_wider; expand and gather are not yet deprecated, but the pivot_* functions have much more power.


At times, data.table can be faster and/or more memory-efficient. If you want to test this with your larger data.

Note: I am using tidyr::complete, since it does its job well. Since many of these operations are summarizing or expanding, data.table's referential semantics do not gain as much advantage, so I feel the cross-package use is not hurting us as much.

Also, I am breaking out each step using magrittr's %>% operator, familiar to the tidyverse. This is by no means required, but I think it can make for more readable code. You can likely shave a nanosecond or two off your execution if you convert from a magrittr pipe-flow to a data.table-only flow.

library(data.table)
library(tidyr)
library(magrittr)
DT <- as.data.table(df)
copy(DT) %>%
  .[, c("buyer", "seller", "amount") := .(seller, buyer, -amount) ] %>%
  list(., DT) %>%
  rbindlist(.) %>%
  .[, .(amount = sum(amount)), by = .(date, buyer, seller) ] %>%
  .[, tidyr::complete(.SD, buyer, seller, fill = list(amount = 0)), by = .(date) ] %>%
  dcast(date + buyer ~ seller, value.var = "amount")
#          date buyer A  B  C D
# 1: 2020-05-05     A 0 -3  0 0
# 2: 2020-05-05     B 3  0  0 0
# 3: 2020-05-05     C 0  0  0 2
# 4: 2020-05-05     D 0  0 -2 0

Traditional data.table flow without %>%:

tmp <- rbindlist(list(
  DT,
  copy(DT)[, c("buyer", "seller", "amount") := .(seller, buyer, -amount)]
))[ , .(amount = sum(amount)), by = .(date, buyer, seller)
   ][ , tidyr::complete(.SD, buyer, seller, fill = list(amount = 0)), by = .(date) ]
dcast(tmp, date + buyer ~ seller, value.var = "amount")

Upvotes: 2

Related Questions