Reputation: 73
My data.table looks like this:
+------------+--------+
| Date | CC RR |
+------------+--------+
| 31.01.1985 | C1 0.1|
| 28.02.1985 | C1 0.8|
| 29.03.1985 | C1 0.9|
| 30.04.1985 | C1 2 |
| 31.05.1985 | C1 0 |
| 28.06.1985 | C1 0 |
| 31.01.1985 | C2 NA |
| 28.02.1985 | C2 0 |
| 29.03.1985 | C2 0.1|
| 30.04.1985 | C2 0 |
| 31.05.1985 | C2 0.4|
| 28.06.1985 | C2 0 |
The CC
(CompanyCode column) column has unique company codes which identify the company, the Date
column has dates, and the RR
column has stock returns. My actual dataset has over 2k companies and a period of 30 years.
My problem is the following: for every company I want to convert the value for RR
in the last rows for each company from zeros to NAs. Specifically, I want to change the value of RR
for the following rows from 0 to NA:
31.05.1985 | c1
28.06.1985 | c1
28.06.1985 | c2
I do not want to change intermediate dates such as 28.02.1985 | c2
or 30.04.1985 c2
, and I want to keep existing NAs such as 31.01.1985 | C2
.
I have tried to use
length(Data[, CC])-match(unique(Data[,CC]),rev(Test3[,CC]))+1
to get a list of indexes of the last occurence of the respective company code and then iterate back until the first non-zero value in the RR column, but I dont know how to apply this on data.table in r...
Of course I could convert all zeros, but this doesn't solve my problem:
Date[RR==0, "RR"] <- NA
Could someone help me out on this? I would appreciate any help. Kind regards.
Upvotes: 0
Views: 111
Reputation: 368449
First off, please post real useable data and not ascii art. I quickly edited what you posted in to a space-delimited file and read it:
R> dt <- fread("/tmp/data.txt")
R> dt
Date CC RR
1: 31.01.1985 C1 0.1
2: 28.02.1985 C1 0.8
3: 29.03.1985 C1 0.9
4: 30.04.1985 C1 2.0
5: 31.05.1985 C1 0.0
6: 28.06.1985 C1 0.0
7: 31.01.1985 C2 NA
8: 28.02.1985 C2 0.0
9: 29.03.1985 C2 0.1
10: 30.04.1985 C2 0.0
11: 31.05.1985 C2 0.4
12: 28.06.1985 C2 0.0
R> dput(dt)
structure(list(Date = c("31.01.1985", "28.02.1985", "29.03.1985",
"30.04.1985", "31.05.1985", "28.06.1985", "31.01.1985", "28.02.1985",
"29.03.1985", "30.04.1985", "31.05.1985", "28.06.1985"), CC = c("C1",
"C1", "C1", "C1", "C1", "C1", "C2", "C2", "C2", "C2", "C2", "C2"
), RR = c(0.1, 0.8, 0.9, 2, 0, 0, NA, 0, 0.1, 0, 0.4, 0)), row.names = c(NA,
-12L), class = c("data.table", "data.frame"),
.internal.selfref = <pointer: 0x5601c8da9cd0>)
R>
The dput()
output can be evaluated by R and re-creates the data structure.
Next, your data analysis. First, you want to work by group! One of the strengths of data.table is the grouping. Second, you probably want something like "max order indices while zero". The following should work:
R> dt[, rle:=rleid(RR), by="CC"]
R> dt
Date CC RR rle
1: 31.01.1985 C1 0.1 1
2: 28.02.1985 C1 0.8 2
3: 29.03.1985 C1 0.9 3
4: 30.04.1985 C1 2.0 4
5: 31.05.1985 C1 0.0 5
6: 28.06.1985 C1 0.0 5
7: 31.01.1985 C2 NA 1
8: 28.02.1985 C2 0.0 2
9: 29.03.1985 C2 0.1 3
10: 30.04.1985 C2 0.0 4
11: 31.05.1985 C2 0.4 5
12: 28.06.1985 C2 0.0 6
R>
We use the super-useful rleid()
function from data.table to get the index value of the rle()
repetition. Now, we "just" need to group again, and if the rle
value is equal to the max(rle)
value we have the position we want and set the value to NA
.
R> dt[, ind:=which.max(rle), by=CC] # find max index
R> dt[ ind==rle & RR==0.0, RR:=NA_real_, by=CC ] # at max ind AND zero set NA
R> dt[, ind:=NULL ] # remove index helper
R> dt
Date CC RR rle
1: 31.01.1985 C1 0.1 1
2: 28.02.1985 C1 0.8 2
3: 29.03.1985 C1 0.9 3
4: 30.04.1985 C1 2.0 4
5: 31.05.1985 C1 NA 5
6: 28.06.1985 C1 NA 5
7: 31.01.1985 C2 NA 1
8: 28.02.1985 C2 0.0 2
9: 29.03.1985 C2 0.1 3
10: 30.04.1985 C2 0.0 4
11: 31.05.1985 C2 0.4 5
12: 28.06.1985 C2 NA 6
R>
Again with just the code:
dt <- fread("/tmp/data.txt")
dt[, rle:=rleid(RR), by=CC]
dt[, ind:=which.max(rle), by=CC]
dt[ ind==rle & RR==0.0, RR:=NA_real_, by=CC ]
dt[, ind:=NULL ]
dt
Upvotes: 3
Reputation: 6073
One approach with data.table
:
Sort dates from old to new and then create a cumulative sum of rr
. Then all rows where the cumsum is 0, we change rr
to NA
. Note that because cumsum
doesn't have an na.rm
arg, there are a couple of extra steps.
# create example data
df <- data.frame(
date = rep(seq(as.Date("1985-01-31"), by="day", length.out=6), 2),
cc = rep(c("c1", "c2"), each=6),
rr = c(0.1, 0.8, 0.9, 2, 0, 0, NA, 0, 0.1, 0, 0.4, 0),
stringsAsFactors = FALSE
)
# change to data.table
library(data.table)
setDT(df)
# sort date old to new
df <- df[order(cc, -date)]
# save a copy of the rr col
df[ , rr_orig := rr]
# turn NAs into 0s because no "na.rm" arg in cumsum
df[is.na(rr), rr := 0]
# create cumsum
df[ , cumrr := cumsum(rr), by=cc]
# replace 0s with NAs in rr, wherever cumsum is 0
df[cumrr == 0, rr := NA]
# put the NAs back into rr from rr_orig
df[is.na(rr_orig), rr := NA]
# clean up by deleting rr_orig and cumrr cols, and re-sort date
df[ , c("rr_orig", "cumrr") := NULL]
df <- df[order(cc, date)]
Result
> df
date cc rr
1: 1985-01-31 c1 0.1
2: 1985-02-01 c1 0.8
3: 1985-02-02 c1 0.9
4: 1985-02-03 c1 2.0
5: 1985-02-04 c1 NA
6: 1985-02-05 c1 NA
7: 1985-01-31 c2 NA
8: 1985-02-01 c2 0.0
9: 1985-02-02 c2 0.1
10: 1985-02-03 c2 0.0
11: 1985-02-04 c2 0.4
12: 1985-02-05 c2 NA
Upvotes: 2