Reputation: 25
I have a dataframe of 5m observations of which a simplified version looks like this:
df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=c("A","B","C"), seller =c("B","A","D"),amount=c(1,4,2))
The above example reads like this: On date 2020-05-05, agent A buys 1 amount off agent B and so on.
In the dataset there are about 500k unique buyers and sellers on more than 800 different dates.
For each date I would like to create a nxn matrix which represents the daily change in inventory for agents who were trading. This daily calculated matrix should then be stored in a list. So for the above example, the result would be:
╔══════════════╗
║ A B C D ║
╠══════════════╣
║ A 0 -3 0 0 ║
║ B +3 0 0 0 ║
║ C 0 0 0 2 ║
║ D 0 0 -2 0 ║
╚══════════════╝
Agent A first buys 1 unit from agent B but then sells back 4 and therefore has -3.
My code looks like this:
library("tidyverse")
df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=as.character(c("A","B","C")), seller =as.character(c("B","A","D")),amount=c(1,4,2))
daily_matrices <- list() #create empty list to store matrices
dates <- unique(as.Date(df$date))
for (i in 1: length(dates)) { # loop over every date
loop_date <- dates[i]
daily_subset <- df %>% filter(date==loop_date) #filter data for each date
daily_subset_long <- daily_subset %>%
gather(key="var", value="agent",buyer,seller)
daily_agents <- distinct(daily_subset_long, agent) # find unique agents
daily_pairs<-combn(daily_agents$agent,2) # find each possible pair
ndim <- dim(daily_agents)[1]
daily_matrices[[i]] <- matrix(data=0,nrow=ndim, ncol=ndim) #span matrix
colnames(daily_matrices[[i]])<-daily_agents$agent #name columns with agents
rownames(daily_matrices[[i]])<-daily_agents$agent #name rows with agents
for (j in 1: dim(daily_pairs)[2]) { # for each possible pair call below function
trading_partner(daily_pairs[1,j],daily_pairs[2,j])
}
print(i) # just to track progress
}
trading_partner <-function(x,y) {
agent_daily_subset <- daily_subset %>% filter(buyer== x & seller== y | buyer== y & seller== x) # filter trades for each pair
agent_daily_subset_long <- agent_daily_subset %>%
gather(key="var", value="agent",buyer,seller)
agent_daily_subset_long <- agent_daily_subset_long %>% group_by(agent) %>%
mutate(delta_inventory = case_when(var =="buyer" ~ amount,
var =="seller" ~ -amount)) # calculates change in inventory for each trade
subgroup_inventory <- agent_daily_subset_long %>% group_by(agent) %>% summarise(inventory = sum(delta_inventory)) # summarisses change in inventory for each of the two agents in a pair
if (dim(subgroup_inventory)[1] >0) { #if there has been a trade between the pair paste the inventory change in the list of matrices and find the correct row and column by the name of the agents
daily_matrices[[i]][as.character(subgroup_inventory[1,1]),as.character(subgroup_inventory[2,1])] <<- as.double(subgroup_inventory[1,2])
daily_matrices[[i]][as.character(subgroup_inventory[2,1]),as.character(subgroup_inventory[1,1])] <<- as.double(subgroup_inventory[2,2])
}
}
This is working as expected but I run into issues as the in the original dataset has roughly 1000 different agents per day and I therefore create huge matrices.
I am aware that using loops in R is not the preferred way in the first place but was not able to come up with another solution. The above code takes approximately 30 minutes for each daily matrix. With 800 days it would take over 2 weeks to finish without hiccup.
Is it possible to speed things up in a best practice way?
Upvotes: 0
Views: 39
Reputation: 160597
Try this:
library(dplyr)
library(tidyr)
df %>%
group_by(date) %>%
do(bind_rows(., transmute(., date, b = buyer, buyer = seller, seller = b, amount = -amount) %>%
select(-b))) %>%
group_by(date, buyer, seller) %>%
summarize(amount = sum(amount)) %>%
group_by(date) %>%
complete(buyer=c(buyer,seller), seller=c(buyer,seller), fill = list(amount = 0)) %>%
ungroup() %>%
pivot_wider(names_from = seller, values_from = amount, values_fill=list(amount=0))
# # A tibble: 4 x 6
# date buyer A B C D
# <date> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020-05-05 A 0 -3 0 0
# 2 2020-05-05 B 3 0 0 0
# 3 2020-05-05 C 0 0 0 2
# 4 2020-05-05 D 0 0 -2 0
FYI: the recommended functions in tidyr
for reshaping are now pivot_longer
and pivot_wider
; expand
and gather
are not yet deprecated, but the pivot_*
functions have much more power.
At times, data.table
can be faster and/or more memory-efficient. If you want to test this with your larger data.
Note: I am using tidyr::complete
, since it does its job well. Since many of these operations are summarizing or expanding, data.table
's referential semantics do not gain as much advantage, so I feel the cross-package use is not hurting us as much.
Also, I am breaking out each step using magrittr
's %>%
operator, familiar to the tidyverse. This is by no means required, but I think it can make for more readable code. You can likely shave a nanosecond or two off your execution if you convert from a magrittr pipe-flow to a data.table
-only flow.
library(data.table)
library(tidyr)
library(magrittr)
DT <- as.data.table(df)
copy(DT) %>%
.[, c("buyer", "seller", "amount") := .(seller, buyer, -amount) ] %>%
list(., DT) %>%
rbindlist(.) %>%
.[, .(amount = sum(amount)), by = .(date, buyer, seller) ] %>%
.[, tidyr::complete(.SD, buyer, seller, fill = list(amount = 0)), by = .(date) ] %>%
dcast(date + buyer ~ seller, value.var = "amount")
# date buyer A B C D
# 1: 2020-05-05 A 0 -3 0 0
# 2: 2020-05-05 B 3 0 0 0
# 3: 2020-05-05 C 0 0 0 2
# 4: 2020-05-05 D 0 0 -2 0
Traditional data.table
flow without %>%
:
tmp <- rbindlist(list(
DT,
copy(DT)[, c("buyer", "seller", "amount") := .(seller, buyer, -amount)]
))[ , .(amount = sum(amount)), by = .(date, buyer, seller)
][ , tidyr::complete(.SD, buyer, seller, fill = list(amount = 0)), by = .(date) ]
dcast(tmp, date + buyer ~ seller, value.var = "amount")
Upvotes: 2