Reputation: 79

How to address the value in the next row

I have a data set of equity returns for some thousand companies for the past 30 years. Some of these companies are "DEAD" (usually delisted or bankrupt), and therefore they have returns = 0. I want to assign NAs to the returns of these companies, but only after the time they actually "died". In order to do so, I have tried to use the following code:

if(Returns$r == 0 & stri_detect_fixed(Returns$Company, "DEAD"), na.rm = TRUE){
  Returns$r[Returns$r == 0 & stri_detect_fixed(Returns$Company, "DEAD")] <- NA
}

This works pretty well, but unfortunately there are on occasion return values equal to 0 for DEAD/delisted companies even before they "died", and these values I want to remain at 0.

Therefore, what I need is a command/if condition telling R that I only want to return NAs if the return in the next row is equal to 0 as well. Do you guys have any suggestions? I hope I made my problem clear, though I do know my explanation may be a bit confusing.

Reproducible example

Returns <- structure(list(Date = c("04.09.17", "05.09.17", "06.09.17", "01.09.17", 
"02.09.17", "03.09.17", "04.09.17", "05.09.17", "06.09.17", "04.09.17", 
"05.09.17", "06.09.17"), Company = c("ORKLA", "ORKLA", "ORKLA", 
"VISMA DEAD 04.09.17", "VISMA DEAD 04.09.17", "VISMA DEAD 04.09.17", 
"VISMA DEAD 04.09.17", "VISMA DEAD 04.09.17", "VISMA DEAD 04.09.17", 
"XNEWCO", "XNEWCO", "XNEWCO"), r = c(0.04, 0, -0.02, 0.01, 0, -0.03, 
0, 0, 0, 0.01, 0, 0)), .Names = c("Date", "Company", "r"), row.names = c(NA, 
-12L), class = "data.frame")

(Edited to cover the case where a "live" company has zero returns at the end of the time series)

My data frame Returns looks something like this:

Date       Company                r
04.09.17   ORKLA                  0.04
05.09.17   ORKLA                  0.00
06.09.17   ORKLA                  -0.02
01.09.17   VISMA DEAD 04.09.17    0.01
02.09.17   VISMA DEAD 04.09.17    0.00
03.09.17   VISMA DEAD 04.09.17    -0.03
04.09.17   VISMA DEAD 04.09.17    0.00
05.09.17   VISMA DEAD 04.09.17    0.00
06.09.17   VISMA DEAD 04.09.17    0.00 
04.09.17   XNEWCO                 0.01
05.09.17   XNEWCO                 0.00
06.09.17   XNEWCO                 0.00

I would want it to be like this:

Date       Company                r
04.09.17   ORKLA                  0.04
05.09.17   ORKLA                  0.00
06.09.17   ORKLA                  -0.02
01.09.17   VISMA DEAD 04.09.17    0.01
02.09.17   VISMA DEAD 04.09.17    0.00
03.09.17   VISMA DEAD 04.09.17    -0.03
04.09.17   VISMA DEAD 04.09.17    NA
05.09.17   VISMA DEAD 04.09.17    NA
06.09.17   VISMA DEAD 04.09.17    NA
04.09.17   XNEWCO                 0.01
05.09.17   XNEWCO                 0.00
06.09.17   XNEWCO                 0.00

My current code (as you'll see above) wouldn't work, as it would replace the return of 0.00 for VISMA 02.09.17 with NA. I need it to remain 0.00, as this is before VISMA "died"

Upvotes: 4

Answers (3)

Uwe

Reputation: 42544

Edit: While preparing a benchmark, I noticed that a condition was missing to prevent the replacement of zero values at the end of a time series also for companies which are alive. Unfortunately, this case was not covered by the original sample data provided by the OP (before editing) so it went undiscovered. I have amended below solutions accordingly.

According to OP's words I have a data set of equity returns for some thousand companies for the past 30 years, the data set may contain several millions rows (a conservative estimate: 250 business days per year * 2000 companies * 5 years of average existence = 2.5 M rows))

Thus, we need to replace a few values without copying the whole data set. data.table allows us to update data in place.

The OP has requested to find all consecutive sequences of zeros at the end of the time series of each company and to replace these zeros by NA.

With data.table, there are two options here:

Using the `rleid()` function

library(data.table)
# coerce to data.table
setDT(Returns)
# convert character dates
Returns[, Date := as.IDate(Date, "%d.%m.%y")][]
# make sure data is ordered
setorder(Returns, Company, Date)[]

Returns[, Company %like% "DEAD" & r == 0 & rleid(r == 0) == max(rleid(r == 0)), 
        by = Company]

For each dead company, the last sequence of zero values is picked:

                Company    V1
 1:               ORKLA FALSE
 2:               ORKLA FALSE
 3:               ORKLA FALSE
 4: VISMA DEAD 04.09.17 FALSE
 5: VISMA DEAD 04.09.17 FALSE
 6: VISMA DEAD 04.09.17 FALSE
 7: VISMA DEAD 04.09.17  TRUE
 8: VISMA DEAD 04.09.17  TRUE
 9: VISMA DEAD 04.09.17  TRUE
10:              XNEWCO FALSE
11:              XNEWCO FALSE
12:              XNEWCO FALSE

The V1 column is used to subset and update DT in place:

Returns[Returns[,  Company %like% "DEAD" & r == 0 & rleid(r == 0) == max(rleid(r == 0)), 
                by = Company]$V1, r := NA_real_][]

          Date             Company     r
 1: 2017-09-04               ORKLA  0.04
 2: 2017-09-05               ORKLA  0.00
 3: 2017-09-06               ORKLA -0.02
 4: 2017-09-01 VISMA DEAD 04.09.17  0.01
 5: 2017-09-02 VISMA DEAD 04.09.17  0.00
 6: 2017-09-03 VISMA DEAD 04.09.17 -0.03
 7: 2017-09-04 VISMA DEAD 04.09.17    NA
 8: 2017-09-05 VISMA DEAD 04.09.17    NA
 9: 2017-09-06 VISMA DEAD 04.09.17    NA
10: 2017-09-04              XNEWCO  0.01
11: 2017-09-05              XNEWCO  0.00
12: 2017-09-06              XNEWCO  0.00

Finding the index of the last non-zero value

Returns[, {tmp <- last(which(r != 0)) 
           if (Company %like% "DEAD" & tmp < .N) .I[seq.int(tmp + 1L, .N)]}, by = Company]

Here, the position of the last non-zero value of each time series is picked which is used to construct the indices for the remaining zero values for dead companies. .I and .N are special symbols in data.table syntax. The check if (Company %like% "DEAD" & tmp < .N) is required in case there are no zero values at the end of a time series of a dead company.

               Company V1
1: VISMA DEAD 04.09.17  7
2: VISMA DEAD 04.09.17  8
3: VISMA DEAD 04.09.17  9

As above, V1 is used to subset and update Returns in place:

Returns[Returns[, {tmp <- last(which(r != 0))
                   if (Company %like% "DEAD" & tmp < .N) .I[seq.int(tmp + 1L, .N)]}, 
                by = Company]$V1, r := NA_real_][]

          Date             Company     r
 1: 2017-09-04               ORKLA  0.04
 2: 2017-09-05               ORKLA  0.00
 3: 2017-09-06               ORKLA -0.02
 4: 2017-09-01 VISMA DEAD 04.09.17  0.01
 5: 2017-09-02 VISMA DEAD 04.09.17  0.00
 6: 2017-09-03 VISMA DEAD 04.09.17 -0.03
 7: 2017-09-04 VISMA DEAD 04.09.17    NA
 8: 2017-09-05 VISMA DEAD 04.09.17    NA
 9: 2017-09-06 VISMA DEAD 04.09.17    NA
10: 2017-09-04              XNEWCO  0.01
11: 2017-09-05              XNEWCO  0.00
12: 2017-09-06              XNEWCO  0.00

Benchmark

Hack-R claimed that his solutions should perform well for up to perhaps 1M rows. So, I wanted to verify that claim with a benchmark.

Creating benchmark data

library(data.table)

# create benchmark data
n_days <- 100L
n_comp <- 100L
n_dead <- round(0.1 * n_comp) # 10 per cent of companies are dead
Date <- seq(from = as.IDate("2015-01-01"), length.out = n_days, by = "1 day")
# company "names" consist of 4 digits at least
Company <- sprintf("%04i", seq_len(n_comp)) 

# cross join to create all combinations
Returns <- CJ(Date = Date, Company = Company)

set.seed(1L) # reuired for reproducible result
Returns[, r := round(rnorm(.N)/10.0, 2L)][]

# dead companies
dead <- data.table(Company = sample(Company, n_dead),
                   dead.date = sample(Date, n_dead))
# modify Returns
Returns[dead, on = .(Company, Date >= dead.date), r := 0]
# modify compay names
Returns[dead, on = "Company", Company := paste(Company, "DEAD", dead.date)]

# IMPORTANT: set order
setorder(Returns, Company, Date)
# keep original version
R0 <- copy(Returns)

Benchmark code

microbenchmark::microbenchmark(
  copy = Returns <- copy(R0),
  hackr1 = {
    mydat <- setDF(copy(R0))
    for(i in 1:nrow(mydat)){
      if(i==nrow(mydat) & mydat$r[i]==0) {mydat$r[i] <- NA
      } else if(!is.na(mydat$r[i]) & mydat$r[i]==0 & mydat$r[i+1]==0) mydat$r[i] <- NA
    }
    res_hackr1 <- mydat
  },
  hackr2 = {
    mydat <- copy(R0)
    tmp0 <- mydat[0,]
    for(c in unique(mydat$Company)){
      tmp <- mydat[mydat$Company==c,]
      for(i in 1:nrow(tmp)){
        if(!is.na(tmp$r[i]) & tmp$r[i]==0 & tmp$r[i+1]==0){
          tmp$r[i:nrow(tmp)] <- NA
        }
      }
      tmp0 <- rbind(tmp0, tmp)
    }
    res_hackr2 <- tmp0
  },
  dt_rleid1 = {
    Returns <- copy(R0)
    Returns[Returns[,  Company %like% "DEAD" & r == 0 & rleid(r == 0) == max(rleid(r == 0)), 
                    by = Company]$V1, r := NA_real_]
    res_dt_rleid1 <- copy(Returns)
    },
  dt_rleid2 = {
    Returns <- copy(R0)
    Returns[Company %like% "DEAD" & Returns[,  r == 0 & rleid(r == 0) == max(rleid(r == 0)), 
                    by = Company]$V1, r := NA_real_]
    res_dt_rleid2 <- copy(Returns)
  },
  dt_last = {
    Returns <- copy(R0)
    Returns[Returns[, {
      tmp <- last(which(r != 0))
      if (Company %like% "DEAD") .I[tmp + seq_len(.N - tmp)]
    }, 
    by = Company]$V1, r := NA_real_]
    res_dt_last <- copy(Returns)
  },
  dt_last2 = {
    Returns <- copy(R0)
    Returns[Returns[, {
      tmp <- last(which(r != 0))
      if (Company %like% "DEAD" & tmp < .N) .I[seq.int(tmp + 1L, .N)]
    }, 
    by = Company]$V1, r := NA_real_]
    res_dt_last2 <- copy(Returns)
  },
  times = 11L
)

As the codes modify the data set in place, copy() is used to create a "fresh" unmodified data set before each run and also to store the result for later comparison. Therefore, copy() is timed as well.

dt_rleid1 and dt_rleid2 as well as dt_last and dt_last2 are code variations of the respective solutions.

Benchmark results

Unfortunately, hackr2 stopped execution with the error message:

Error in if (!is.na(tmp$r[i]) & tmp$r[i] == 0 & tmp$r[i + 1] == 0) { :
missing value where TRUE/FALSE needed

The timings for the remaining solutions are:

Unit: microseconds
      expr        min         lq         mean     median          uq        max neval cld
      copy     46.065     48.331     53.75427     52.485     58.1475     66.077    11  a 
    hackr1 267515.143 269559.179 277240.15827 271093.857 275196.8435 329919.874    11   b
 dt_rleid1   2203.942   2404.060   3130.73218   2690.267   3728.9925   4813.783    11  a 
 dt_rleid2   2577.370   2665.346   5750.63073   2700.839   2741.0510  36395.429    11  a 
   dt_last   1605.098   1627.564   1718.85318   1654.561   1724.6030   2036.296    11  a 
  dt_last2   1665.134   1718.372   1945.67645   1764.438   1769.5350   3909.476    11  a

The data.table solutions are two magnitudes faster than Hack-R's approach for a rather small problem size of 100 x 100 = 10 K rows. I tried to run Hack-R's solution for 1000 x 1000 = 1 M rows but I had not the patience to wait for the result.

For 1 M rows, the approach finding the last non-zero value is about 5 to 6 times faster the the rleid() approach.

Unit: milliseconds
      expr        min         lq      mean     median        uq      max neval cld
      copy   6.602008   6.843094  21.23383   7.297889  13.61614 141.5794    11 a  
 dt_rleid1  63.282609  70.239165 142.21568 193.972143 199.32077 224.5657    11  b 
 dt_rleid2 157.939571 281.185658 266.62148 288.184692 291.61445 309.5796    11   c
   dt_last  35.826792  39.198781 101.66298  48.387030 172.40187 182.2354    11  b 
  dt_last2  36.507194  43.754676 103.95414  48.879018 173.66035 183.1639    11  b

Upvotes: 2

Hack-R

Reputation: 23214

This can be done by modifying the data in place:

# Please use dput() or a reproducible way of sharing your data

mydat <-
read.table(text="Date       Company                r
           '04.09.17'   ORKLA                  0.04
           '05.09.17'   ORKLA                  0.00
           '06.09.17'   ORKLA                  -0.02
           '01.09.17'   VISMA    0.01
           '02.09.17'   VISMA    0.00
           '03.09.17'   VISMA    -0.03
           '04.09.17'   VISMA    0.00
           '05.09.17'   VISMA    0.00
           '06.09.17'   VISMA    0.00",header=T)

for(i in 1:nrow(mydat)){
  if(i==nrow(mydat) & mydat$r[i]==0) {mydat$r[i] <- NA
  } else if(!is.na(mydat$r[i]) & mydat$r[i]==0 & mydat$r[i+1]==0) mydat$r[i] <- NA
}

      Date Company     r
1 04.09.17   ORKLA  0.04
2 05.09.17   ORKLA  0.00
3 06.09.17   ORKLA -0.02
4 01.09.17   VISMA  0.01
5 02.09.17   VISMA  0.00
6 03.09.17   VISMA -0.03
7 04.09.17   VISMA    NA
8 05.09.17   VISMA    NA
9 06.09.17   VISMA    NA

The logic says this:

If the value r is not already NA (which we have to check because you can't do a logical evaluation on an NA) and the value is 0 now and in the next row, then the company is dead, so make r = NA.

If this is the last row of the data set and the value is 0 then I can't look in the future, so I'll assume it's dead. Change NA to 0 to reverse this assumption. We can also add some company-level logic to improve this, if we like:

# Same result as above, but handles the last row better by considering company

tmp0 <- mydat[0,]
for(c in unique(mydat$Company)){
  tmp <- mydat[mydat$Company==c,]
  for(i in 1:nrow(tmp)){
    if(!is.na(tmp$r[i]) & tmp$r[i]==0 & tmp$r[i+1]==0){
      tmp$r[i:nrow(tmp)] <- NA
    } 
  }
  tmp0 <- rbind(tmp0, tmp) 

}
tmp0

I like the logic of the 2nd way marginally better, but both should work and should perform well for up to perhaps 1M rows. If you want to go beyond that we can just sapply the same logic instead of using a loop, and/or use any number of big data types like tibble or data.table.

Upvotes: 0

Katerina

Reputation: 139

There might be a simpler solution, but I take it step by step the long way without a loop / function.

library( data.table )
library( stringr )


# Create a dummy variable **status_delisting** to show if the company is dead. 
df$status_delisting = ifelse( grepl( "DEAD", df$Company ), 1, 0 )

# Find names with numbers in it, check if the numbers are dates and convert to format. Sometimes the company has numbers in the name. 
df$Company = as.character( df$Company )
check_values = c( unique( df$Company ) )
setDT(check_values)
names( check_values ) = "check_memo"

# You might need this as well. 
# Sys.getlocale()
# Sys.setlocale(locale="C")
# Check if there are dates in the name
# The date format we check is N.N.N at least. 
check_values$Date_Flag = ifelse( grepl("([0-9]+)(.)([0-9]+)([0-9]+)", 
check_values$check_memo), 1, 0 )
# Create new column with the proposed format of date
# dd . mm . yy
pat <- "[0-9][0-9][.][0-9][0-9][.][0-9][0-9]"
check_values[,Date_Flag := str_count( check_memo, pat ) == 1 ]
check_values[(Date_Flag),  paste0( "Date", 1 ) := transpose( str_extract_all( check_memo, pat ))]

Remove companies that are not delisted (without a date of delisting).

setDF( check_values )
check_values = filter( check_values, !is.na(Date1))

Merge the dead companies with the data frame

df = left_join( x = df, y = check_values, by = c("Company" = "check_memo"))

Format both dates as dates

df$Date = as.Date( df$Date, format = "%d.%m.%y")
df$Date1 = as.Date( df$Date1, format = "%d.%m.%y")

Create a new column with the returns as desired. You can remove the redundant columns.

 df$returns = ifelse(
  df$status_delisting == 1, 
 ifelse(df$Date <= df$Date1, df$r, NA), df$r )