Brian
Brian

Reputation: 45

Create new column of closest values according to another column by groups

I have the following data frame (called mydata_tsample):

cusip_id     trd_exctn_dt   trd_exctn_tm    price   contra_party_type **refPrice**
BUHADU       01.04.2016     01:10:50        101.00  C                 102.10
BUHADU       01.04.2016     02:10:50        101.50  C                 102.10    
BUHADU       01.04.2016     08:10:50        102.10  D                 102.10
BUHADU       01.04.2016     09:10:50        102.10  C                 102.10
BUHADU       02.04.2016     07:12:50        90.50   C                 90.85
BUHADU       02.04.2016     09:10:55        90.85   D                 90.85
BUHADU       02.04.2016     12:11:40        90.90   C                 91.00
BUHADU       02.04.2016     12:12:02        91.00   D                 91.00
XDSEOI       03.04.2016     06:52:51        50.00   D                 50.00 
XDSEOI       03.04.2016     08:40:58        50.20   C                 50.00  
XDSEOI       03.04.2016     15:10:51        51.00   C                 52.00
XDSEOI       03.04.2016     15:14:51        52.00   D                 52.00

I'd like to generate/add a new column (called refPrice) that is calculated using a for loop. For each row in column RefPrice, I'd like to extract the price with following conditions:

  1. same cusip_ID
  2. same trd_exctn_dt
  3. contra_party_type = D
  4. then take the closest price to trd_exctn_tm

I did a code that does exactly this:

for (i in 1:nrow(mydata_tsample)){
      Mtx_aftr_CUSIP=mydata_tsample[mydata_tsample$cusip_id %in% mydata_tsample[i,1],]
      Mtx_aftr_CUSIP_dt=Mtx_aftr_CUSIP[Mtx_aftr_CUSIP$trd_exctn_dt %in% mydata_tsample[i,2],]
      Mtx_aftr_CUSIP_dt_dealer=Mtx_aftr_CUSIP_dt[Mtx_aftr_CUSIP_dt$contra_party_type %in% "D",]
      if(nrow(Mtx_aftr_CUSIP_dt_dealer)==0) {next} else 
      {
        closesttime=which.min(abs(Mtx_aftr_CUSIP_dt_dealer$trd_exctn_tm - mydata_tsample[i,3]))
        mydata_tsample$RefPrice[i]=Mtx_aftr_CUSIP_dt_dealer[closesttime,4]  }
}

The problem I have is speed. I takes me a couple of hours to process 0.5Mio. lines. In total I have 5Mio. lines...

I tried with doParallel, but I did not work out.

library(doParallel)
registerDoParallel(cores=4)
library(foreach)
foreach(i=1:nrow(mydata_tsample)) %dopar% {
  Mtx_aftr_CUSIP=mydata_tsample[mydata_tsample$cusip_id %in% mydata_tsample[i,1],]
  Mtx_aftr_CUSIP_dt=Mtx_aftr_CUSIP[Mtx_aftr_CUSIP$trd_exctn_dt %in% mydata_tsample[i,2],]
  Mtx_aftr_CUSIP_dt_dealer=Mtx_aftr_CUSIP_dt[Mtx_aftr_CUSIP_dt$contra_party_type %in% "D",]
  if(nrow(Mtx_aftr_CUSIP_dt_dealer)==0) {next} else 
  {
    closesttime=which.min(abs(Mtx_aftr_CUSIP_dt_dealer$trd_exctn_tm - mydata_tsample[i,3]))
    mydata_tsample$RefPrice[i]=Mtx_aftr_CUSIP_dt_dealer[closesttime,4]
  }
}

}

Upvotes: 1

Views: 244

Answers (2)

Martin Schmelzer
Martin Schmelzer

Reputation: 23919

Here is a really fast solution using a rolling join with data.table which takes only a couple of ms on 500000 rows:

Data:

dt <- fread("cusip_id  trd_exctn_dt   trd_exctn_tm    price   contra_party_type
                 BUHADU   01.04.2016     01:10:50        101.00  C
                 BUHADU   01.04.2016     02:10:50        101.50  C    
                 BUHADU   01.04.2016     08:10:50        102.10  D
                 BUHADU   01.04.2016     09:10:50        102.10  C
                 BUHADU   02.04.2016     07:12:50        90.50   C
                 BUHADU   02.04.2016     09:10:55        90.85   D
                 BUHADU   02.04.2016     12:11:40        90.90   C
                 BUHADU   02.04.2016     12:12:02        91.00   D
                 XDSEOI   03.04.2016     06:52:51        50.00   D 
                 XDSEOI   03.04.2016     08:40:58        50.20   C  
                 XDSEOI   03.04.2016     15:10:51        51.00   C
                 XDSEOI   03.04.2016     15:14:51        52.00   D
                 XDSEOI   03.04.2016     23:59:00        58.00   D
                 XDSEOI   04.04.2016     01:00:00        52.00   C
                 XDSEOI   04.04.2016     15:14:51        55.00   D")

Code:

library(data.table)
library(lubridate)

# Convert trd_exctn_tm to number of seconds (or create a new column)
dt[, trd_exctn_tm := as.numeric(hms(x = trd_exctn_tm)),]

# set keys
setkey(dt, cusip_id, trd_exctn_dt, trd_exctn_tm)

# keep rollin rollin rollin...
dt[contra_party_type == "D", .(cusip_id, trd_exctn_dt, trd_exctn_tm, RefPrice=price),][dt,, roll = "nearest"]

Output:

    cusip_id trd_exctn_dt trd_exctn_tm RefPrice  price contra_party_type
 1:   BUHADU   01.04.2016         4250   102.10 101.00                 C
 2:   BUHADU   01.04.2016         7850   102.10 101.50                 C
 3:   BUHADU   01.04.2016        29450   102.10 102.10                 D
 4:   BUHADU   01.04.2016        33050   102.10 102.10                 C
 5:   BUHADU   02.04.2016        25970    90.85  90.50                 C
 6:   BUHADU   02.04.2016        33055    90.85  90.85                 D
 7:   BUHADU   02.04.2016        43900    91.00  90.90                 C
 8:   BUHADU   02.04.2016        43922    91.00  91.00                 D
 9:   XDSEOI   03.04.2016        24771    50.00  50.00                 D
10:   XDSEOI   03.04.2016        31258    50.00  50.20                 C
11:   XDSEOI   03.04.2016        54651    52.00  51.00                 C
12:   XDSEOI   03.04.2016        54891    52.00  52.00                 D
13:   XDSEOI   03.04.2016        86340    58.00  58.00                 D
14:   XDSEOI   04.04.2016         3600    55.00  52.00                 C
15:   XDSEOI   04.04.2016        54891    55.00  55.00                 D

Explanations:

The first part of our data.table operation

dt[contra_party_type == "D", .(cusip_id, trd_exctn_dt, trd_exctn_tm, RefPrice=price),]

can be translated to

Take dt, subset rows where contra... == "D", select columns cusip_id, ..., and RefPrice which equals price.

So this data.table looks like

   cusip_id trd_exctn_dt trd_exctn_tm RefPrice
1:   BUHADU   01.04.2016     08:10:50   102.10
2:   BUHADU   02.04.2016     09:10:55    90.85
3:   BUHADU   02.04.2016     12:12:02    91.00
4:   XDSEOI   03.04.2016     06:52:51    50.00
5:   XDSEOI   03.04.2016     15:14:51    52.00
6:   XDSEOI   03.04.2016     23:59:00    58.00
7:   XDSEOI   04.04.2016     15:14:51    55.00

Saving it as dt2 and setting the same keys with setkey(dt, cusip_id, trd_exctn_dt, trd_exctn_tm), we can go over to the second part of our command:

dt2[dt,, roll = "nearest"]

For the purpose of understanding change it to

dt2[dt,,] 

and look at the result. You can see that we joined both tables by our key columns. RefPrice was added to dt. But there are NAs in RefPrice, because these rows were not found in dt2. To get rid of these NAs we use roll = "nearest", meaning take the closest value of RefPrice in dt2 according to trd_exctn_tm and fill these rows.

Upvotes: 3

user3603486
user3603486

Reputation:

Here is a simple partial solution that runs in seconds, and gets the nearest previous price where contra_party_type=="D".

# generate toy data:
library(dplyr)
library(zoo)
n <- 500000
dfr <- dplyr::tibble(
  cusip_id = sample(LETTERS, n, replace = TRUE),
  trd_exctn_dt = as.Date(sample(365, n, replace = TRUE), 
    origin = "2016-01-01"),
  trd_exctn_tm = strftime(as.POSIXlt(sample(60*60*24, n, replace = TRUE),
    origin = "1970-01-01"), "%H:%M:%S"),
  price = round(rnorm(n, 100, 5), 2),
  contra_party_type = sample(LETTERS[1:4], n, replace = TRUE)
)


dfr <- dfr %>% 
      group_by(cusip_id, trd_exctn_dt) %>% 
      arrange(trd_exctn_tm, .by_group = TRUE) %>% 
      mutate(
        refprice = ifelse(contra_party_type == "D", price, NA),
        refprice = zoo::na.locf(refprice, na.rm = FALSE)
      )
dfr

# A tibble: 500,000 x 6
# Groups:   cusip_id, trd_exctn_dt [9,490]
   cusip_id trd_exctn_dt trd_exctn_tm price contra_party_type refprice
   <chr>    <date>       <chr>        <dbl> <chr>                <dbl>
 1 A        2016-01-02   00:25:47      89.6 D                     89.6
 2 A        2016-01-02   01:19:37     101.  B                     89.6
 3 A        2016-01-02   01:22:34     108.  B                     89.6
 4 A        2016-01-02   01:28:14     102.  D                    102. 
 5 A        2016-01-02   01:35:36      95.9 A                    102. 
 6 A        2016-01-02   01:45:01     102.  C                    102. 

To do exactly what you want, I would

  • calculate the time difference to the last previous instance where cpt is D
  • calculate the time difference to the next future instance where cpt is D
  • calculate the prices for each of those instances
  • choose the price based on the closest time difference, using ifelse

Upvotes: 0

Related Questions