Reputation: 79
I have a data set of equity returns for some thousand companies for the past 30 years. Some of these companies are "DEAD" (usually delisted or bankrupt), and therefore they have returns = 0. I want to assign NAs to the returns of these companies, but only after the time they actually "died". In order to do so, I have tried to use the following code:
if(Returns$r == 0 & stri_detect_fixed(Returns$Company, "DEAD"), na.rm = TRUE){
Returns$r[Returns$r == 0 & stri_detect_fixed(Returns$Company, "DEAD")] <- NA
}
This works pretty well, but unfortunately there are on occasion return values equal to 0 for DEAD/delisted companies even before they "died", and these values I want to remain at 0.
Therefore, what I need is a command/if condition telling R that I only want to return NAs if the return in the next row is equal to 0 as well. Do you guys have any suggestions? I hope I made my problem clear, though I do know my explanation may be a bit confusing.
Returns <- structure(list(Date = c("04.09.17", "05.09.17", "06.09.17", "01.09.17",
"02.09.17", "03.09.17", "04.09.17", "05.09.17", "06.09.17", "04.09.17",
"05.09.17", "06.09.17"), Company = c("ORKLA", "ORKLA", "ORKLA",
"VISMA DEAD 04.09.17", "VISMA DEAD 04.09.17", "VISMA DEAD 04.09.17",
"VISMA DEAD 04.09.17", "VISMA DEAD 04.09.17", "VISMA DEAD 04.09.17",
"XNEWCO", "XNEWCO", "XNEWCO"), r = c(0.04, 0, -0.02, 0.01, 0, -0.03,
0, 0, 0, 0.01, 0, 0)), .Names = c("Date", "Company", "r"), row.names = c(NA,
-12L), class = "data.frame")
(Edited to cover the case where a "live" company has zero returns at the end of the time series)
My data frame Returns
looks something like this:
Date Company r
04.09.17 ORKLA 0.04
05.09.17 ORKLA 0.00
06.09.17 ORKLA -0.02
01.09.17 VISMA DEAD 04.09.17 0.01
02.09.17 VISMA DEAD 04.09.17 0.00
03.09.17 VISMA DEAD 04.09.17 -0.03
04.09.17 VISMA DEAD 04.09.17 0.00
05.09.17 VISMA DEAD 04.09.17 0.00
06.09.17 VISMA DEAD 04.09.17 0.00
04.09.17 XNEWCO 0.01
05.09.17 XNEWCO 0.00
06.09.17 XNEWCO 0.00
I would want it to be like this:
Date Company r
04.09.17 ORKLA 0.04
05.09.17 ORKLA 0.00
06.09.17 ORKLA -0.02
01.09.17 VISMA DEAD 04.09.17 0.01
02.09.17 VISMA DEAD 04.09.17 0.00
03.09.17 VISMA DEAD 04.09.17 -0.03
04.09.17 VISMA DEAD 04.09.17 NA
05.09.17 VISMA DEAD 04.09.17 NA
06.09.17 VISMA DEAD 04.09.17 NA
04.09.17 XNEWCO 0.01
05.09.17 XNEWCO 0.00
06.09.17 XNEWCO 0.00
My current code (as you'll see above) wouldn't work, as it would replace the return of 0.00 for VISMA 02.09.17 with NA. I need it to remain 0.00, as this is before VISMA "died"
Upvotes: 4
Views: 1165
Reputation: 42544
Edit: While preparing a benchmark, I noticed that a condition was missing to prevent the replacement of zero values at the end of a time series also for companies which are alive. Unfortunately, this case was not covered by the original sample data provided by the OP (before editing) so it went undiscovered. I have amended below solutions accordingly.
According to OP's words I have a data set of equity returns for some thousand companies for the past 30 years, the data set may contain several millions rows (a conservative estimate: 250 business days per year * 2000 companies * 5 years of average existence = 2.5 M rows))
Thus, we need to replace a few values without copying the whole data set. data.table
allows us to update data in place.
The OP has requested to find all consecutive sequences of zeros at the end of the time series of each company and to replace these zeros by NA
.
With data.table
, there are two options here:
rleid()
functionlibrary(data.table)
# coerce to data.table
setDT(Returns)
# convert character dates
Returns[, Date := as.IDate(Date, "%d.%m.%y")][]
# make sure data is ordered
setorder(Returns, Company, Date)[]
Returns[, Company %like% "DEAD" & r == 0 & rleid(r == 0) == max(rleid(r == 0)),
by = Company]
For each dead company, the last sequence of zero values is picked:
Company V1 1: ORKLA FALSE 2: ORKLA FALSE 3: ORKLA FALSE 4: VISMA DEAD 04.09.17 FALSE 5: VISMA DEAD 04.09.17 FALSE 6: VISMA DEAD 04.09.17 FALSE 7: VISMA DEAD 04.09.17 TRUE 8: VISMA DEAD 04.09.17 TRUE 9: VISMA DEAD 04.09.17 TRUE 10: XNEWCO FALSE 11: XNEWCO FALSE 12: XNEWCO FALSE
The V1
column is used to subset and update DT
in place:
Returns[Returns[, Company %like% "DEAD" & r == 0 & rleid(r == 0) == max(rleid(r == 0)),
by = Company]$V1, r := NA_real_][]
Date Company r 1: 2017-09-04 ORKLA 0.04 2: 2017-09-05 ORKLA 0.00 3: 2017-09-06 ORKLA -0.02 4: 2017-09-01 VISMA DEAD 04.09.17 0.01 5: 2017-09-02 VISMA DEAD 04.09.17 0.00 6: 2017-09-03 VISMA DEAD 04.09.17 -0.03 7: 2017-09-04 VISMA DEAD 04.09.17 NA 8: 2017-09-05 VISMA DEAD 04.09.17 NA 9: 2017-09-06 VISMA DEAD 04.09.17 NA 10: 2017-09-04 XNEWCO 0.01 11: 2017-09-05 XNEWCO 0.00 12: 2017-09-06 XNEWCO 0.00
Returns[, {tmp <- last(which(r != 0))
if (Company %like% "DEAD" & tmp < .N) .I[seq.int(tmp + 1L, .N)]}, by = Company]
Here, the position of the last non-zero value of each time series is picked which is used to construct the indices for the remaining zero values for dead companies. .I
and .N
are special symbols in data.table
syntax. The check if (Company %like% "DEAD" & tmp < .N)
is required in case there are no zero values at the end of a time series of a dead company.
Company V1 1: VISMA DEAD 04.09.17 7 2: VISMA DEAD 04.09.17 8 3: VISMA DEAD 04.09.17 9
As above, V1
is used to subset and update Returns
in place:
Returns[Returns[, {tmp <- last(which(r != 0))
if (Company %like% "DEAD" & tmp < .N) .I[seq.int(tmp + 1L, .N)]},
by = Company]$V1, r := NA_real_][]
Date Company r 1: 2017-09-04 ORKLA 0.04 2: 2017-09-05 ORKLA 0.00 3: 2017-09-06 ORKLA -0.02 4: 2017-09-01 VISMA DEAD 04.09.17 0.01 5: 2017-09-02 VISMA DEAD 04.09.17 0.00 6: 2017-09-03 VISMA DEAD 04.09.17 -0.03 7: 2017-09-04 VISMA DEAD 04.09.17 NA 8: 2017-09-05 VISMA DEAD 04.09.17 NA 9: 2017-09-06 VISMA DEAD 04.09.17 NA 10: 2017-09-04 XNEWCO 0.01 11: 2017-09-05 XNEWCO 0.00 12: 2017-09-06 XNEWCO 0.00
Hack-R claimed that his solutions should perform well for up to perhaps 1M rows. So, I wanted to verify that claim with a benchmark.
library(data.table)
# create benchmark data
n_days <- 100L
n_comp <- 100L
n_dead <- round(0.1 * n_comp) # 10 per cent of companies are dead
Date <- seq(from = as.IDate("2015-01-01"), length.out = n_days, by = "1 day")
# company "names" consist of 4 digits at least
Company <- sprintf("%04i", seq_len(n_comp))
# cross join to create all combinations
Returns <- CJ(Date = Date, Company = Company)
set.seed(1L) # reuired for reproducible result
Returns[, r := round(rnorm(.N)/10.0, 2L)][]
# dead companies
dead <- data.table(Company = sample(Company, n_dead),
dead.date = sample(Date, n_dead))
# modify Returns
Returns[dead, on = .(Company, Date >= dead.date), r := 0]
# modify compay names
Returns[dead, on = "Company", Company := paste(Company, "DEAD", dead.date)]
# IMPORTANT: set order
setorder(Returns, Company, Date)
# keep original version
R0 <- copy(Returns)
microbenchmark::microbenchmark(
copy = Returns <- copy(R0),
hackr1 = {
mydat <- setDF(copy(R0))
for(i in 1:nrow(mydat)){
if(i==nrow(mydat) & mydat$r[i]==0) {mydat$r[i] <- NA
} else if(!is.na(mydat$r[i]) & mydat$r[i]==0 & mydat$r[i+1]==0) mydat$r[i] <- NA
}
res_hackr1 <- mydat
},
hackr2 = {
mydat <- copy(R0)
tmp0 <- mydat[0,]
for(c in unique(mydat$Company)){
tmp <- mydat[mydat$Company==c,]
for(i in 1:nrow(tmp)){
if(!is.na(tmp$r[i]) & tmp$r[i]==0 & tmp$r[i+1]==0){
tmp$r[i:nrow(tmp)] <- NA
}
}
tmp0 <- rbind(tmp0, tmp)
}
res_hackr2 <- tmp0
},
dt_rleid1 = {
Returns <- copy(R0)
Returns[Returns[, Company %like% "DEAD" & r == 0 & rleid(r == 0) == max(rleid(r == 0)),
by = Company]$V1, r := NA_real_]
res_dt_rleid1 <- copy(Returns)
},
dt_rleid2 = {
Returns <- copy(R0)
Returns[Company %like% "DEAD" & Returns[, r == 0 & rleid(r == 0) == max(rleid(r == 0)),
by = Company]$V1, r := NA_real_]
res_dt_rleid2 <- copy(Returns)
},
dt_last = {
Returns <- copy(R0)
Returns[Returns[, {
tmp <- last(which(r != 0))
if (Company %like% "DEAD") .I[tmp + seq_len(.N - tmp)]
},
by = Company]$V1, r := NA_real_]
res_dt_last <- copy(Returns)
},
dt_last2 = {
Returns <- copy(R0)
Returns[Returns[, {
tmp <- last(which(r != 0))
if (Company %like% "DEAD" & tmp < .N) .I[seq.int(tmp + 1L, .N)]
},
by = Company]$V1, r := NA_real_]
res_dt_last2 <- copy(Returns)
},
times = 11L
)
As the codes modify the data set in place, copy()
is used to create a "fresh" unmodified data set before each run and also to store the result for later comparison. Therefore, copy()
is timed as well.
dt_rleid1
and dt_rleid2
as well as dt_last
and dt_last2
are code variations of the respective solutions.
Unfortunately, hackr2 stopped execution with the error message:
Error in if (!is.na(tmp$r[i]) & tmp$r[i] == 0 & tmp$r[i + 1] == 0) { :
missing value where TRUE/FALSE needed
The timings for the remaining solutions are:
Unit: microseconds expr min lq mean median uq max neval cld copy 46.065 48.331 53.75427 52.485 58.1475 66.077 11 a hackr1 267515.143 269559.179 277240.15827 271093.857 275196.8435 329919.874 11 b dt_rleid1 2203.942 2404.060 3130.73218 2690.267 3728.9925 4813.783 11 a dt_rleid2 2577.370 2665.346 5750.63073 2700.839 2741.0510 36395.429 11 a dt_last 1605.098 1627.564 1718.85318 1654.561 1724.6030 2036.296 11 a dt_last2 1665.134 1718.372 1945.67645 1764.438 1769.5350 3909.476 11 a
The data.table
solutions are two magnitudes faster than Hack-R's approach for a rather small problem size of 100 x 100 = 10 K rows. I tried to run Hack-R's solution for 1000 x 1000 = 1 M rows but I had not the patience to wait for the result.
For 1 M rows, the approach finding the last non-zero value is about 5 to 6 times faster the the rleid()
approach.
Unit: milliseconds expr min lq mean median uq max neval cld copy 6.602008 6.843094 21.23383 7.297889 13.61614 141.5794 11 a dt_rleid1 63.282609 70.239165 142.21568 193.972143 199.32077 224.5657 11 b dt_rleid2 157.939571 281.185658 266.62148 288.184692 291.61445 309.5796 11 c dt_last 35.826792 39.198781 101.66298 48.387030 172.40187 182.2354 11 b dt_last2 36.507194 43.754676 103.95414 48.879018 173.66035 183.1639 11 b
Upvotes: 2
Reputation: 23214
This can be done by modifying the data in place:
# Please use dput() or a reproducible way of sharing your data
mydat <-
read.table(text="Date Company r
'04.09.17' ORKLA 0.04
'05.09.17' ORKLA 0.00
'06.09.17' ORKLA -0.02
'01.09.17' VISMA 0.01
'02.09.17' VISMA 0.00
'03.09.17' VISMA -0.03
'04.09.17' VISMA 0.00
'05.09.17' VISMA 0.00
'06.09.17' VISMA 0.00",header=T)
for(i in 1:nrow(mydat)){
if(i==nrow(mydat) & mydat$r[i]==0) {mydat$r[i] <- NA
} else if(!is.na(mydat$r[i]) & mydat$r[i]==0 & mydat$r[i+1]==0) mydat$r[i] <- NA
}
Date Company r 1 04.09.17 ORKLA 0.04 2 05.09.17 ORKLA 0.00 3 06.09.17 ORKLA -0.02 4 01.09.17 VISMA 0.01 5 02.09.17 VISMA 0.00 6 03.09.17 VISMA -0.03 7 04.09.17 VISMA NA 8 05.09.17 VISMA NA 9 06.09.17 VISMA NA
The logic says this:
If the value r
is not already NA (which we have to check because you can't do a logical evaluation on an NA) and the value is 0 now and in the next row, then the company is dead, so make r
= NA.
If this is the last row of the data set and the value is 0 then I can't look in the future, so I'll assume it's dead. Change NA
to 0
to reverse this assumption. We can also add some company-level logic to improve this, if we like:
# Same result as above, but handles the last row better by considering company
tmp0 <- mydat[0,]
for(c in unique(mydat$Company)){
tmp <- mydat[mydat$Company==c,]
for(i in 1:nrow(tmp)){
if(!is.na(tmp$r[i]) & tmp$r[i]==0 & tmp$r[i+1]==0){
tmp$r[i:nrow(tmp)] <- NA
}
}
tmp0 <- rbind(tmp0, tmp)
}
tmp0
I like the logic of the 2nd way marginally better, but both should work and should perform well for up to perhaps 1M rows. If you want to go beyond that we can just sapply
the same logic instead of using a loop, and/or use any number of big data types like tibble
or data.table
.
Upvotes: 0
Reputation: 139
There might be a simpler solution, but I take it step by step the long way without a loop / function.
library( data.table )
library( stringr )
# Create a dummy variable **status_delisting** to show if the company is dead.
df$status_delisting = ifelse( grepl( "DEAD", df$Company ), 1, 0 )
# Find names with numbers in it, check if the numbers are dates and convert to format. Sometimes the company has numbers in the name.
df$Company = as.character( df$Company )
check_values = c( unique( df$Company ) )
setDT(check_values)
names( check_values ) = "check_memo"
# You might need this as well.
# Sys.getlocale()
# Sys.setlocale(locale="C")
# Check if there are dates in the name
# The date format we check is N.N.N at least.
check_values$Date_Flag = ifelse( grepl("([0-9]+)(.)([0-9]+)([0-9]+)",
check_values$check_memo), 1, 0 )
# Create new column with the proposed format of date
# dd . mm . yy
pat <- "[0-9][0-9][.][0-9][0-9][.][0-9][0-9]"
check_values[,Date_Flag := str_count( check_memo, pat ) == 1 ]
check_values[(Date_Flag), paste0( "Date", 1 ) := transpose( str_extract_all( check_memo, pat ))]
Remove companies that are not delisted (without a date of delisting).
setDF( check_values )
check_values = filter( check_values, !is.na(Date1))
Merge the dead companies with the data frame
df = left_join( x = df, y = check_values, by = c("Company" = "check_memo"))
Format both dates as dates
df$Date = as.Date( df$Date, format = "%d.%m.%y")
df$Date1 = as.Date( df$Date1, format = "%d.%m.%y")
Create a new column with the returns as desired. You can remove the redundant columns.
df$returns = ifelse(
df$status_delisting == 1,
ifelse(df$Date <= df$Date1, df$r, NA), df$r )
Upvotes: 0