Delete zeros at end of date range for each group

Question

My data.table looks like this:

+------------+--------+
|    Date    | CC  RR |
+------------+--------+
| 31.01.1985 | C1  0.1|
| 28.02.1985 | C1  0.8|
| 29.03.1985 | C1  0.9|
| 30.04.1985 | C1  2  |
| 31.05.1985 | C1  0  |
| 28.06.1985 | C1  0  |
| 31.01.1985 | C2  NA |
| 28.02.1985 | C2  0  |
| 29.03.1985 | C2  0.1|
| 30.04.1985 | C2  0  |
| 31.05.1985 | C2  0.4|
| 28.06.1985 | C2  0  |

The CC (CompanyCode column) column has unique company codes which identify the company, the Date column has dates, and the RR column has stock returns. My actual dataset has over 2k companies and a period of 30 years.

My problem is the following: for every company I want to convert the value for RR in the last rows for each company from zeros to NAs. Specifically, I want to change the value of RR for the following rows from 0 to NA:

31.05.1985 | c1
28.06.1985 | c1
28.06.1985 | c2

I do not want to change intermediate dates such as 28.02.1985 | c2 or 30.04.1985 c2, and I want to keep existing NAs such as 31.01.1985 | C2.

I have tried to use

length(Data[, CC])-match(unique(Data[,CC]),rev(Test3[,CC]))+1

to get a list of indexes of the last occurence of the respective company code and then iterate back until the first non-zero value in the RR column, but I dont know how to apply this on data.table in r...

Of course I could convert all zeros, but this doesn't solve my problem:

Date[RR==0, "RR"] <- NA

Could someone help me out on this? I would appreciate any help. Kind regards.

Dirk is no longer here · Accepted Answer

First off, please post real useable data and not ascii art. I quickly edited what you posted in to a space-delimited file and read it:

R> dt <- fread("/tmp/data.txt")
R> dt
          Date CC  RR
 1: 31.01.1985 C1 0.1
 2: 28.02.1985 C1 0.8
 3: 29.03.1985 C1 0.9
 4: 30.04.1985 C1 2.0
 5: 31.05.1985 C1 0.0
 6: 28.06.1985 C1 0.0
 7: 31.01.1985 C2  NA
 8: 28.02.1985 C2 0.0
 9: 29.03.1985 C2 0.1
10: 30.04.1985 C2 0.0
11: 31.05.1985 C2 0.4
12: 28.06.1985 C2 0.0
R> dput(dt)
structure(list(Date = c("31.01.1985", "28.02.1985", "29.03.1985", 
"30.04.1985", "31.05.1985", "28.06.1985", "31.01.1985", "28.02.1985", 
"29.03.1985", "30.04.1985", "31.05.1985", "28.06.1985"), CC = c("C1", 
"C1", "C1", "C1", "C1", "C1", "C2", "C2", "C2", "C2", "C2", "C2"
), RR = c(0.1, 0.8, 0.9, 2, 0, 0, NA, 0, 0.1, 0, 0.4, 0)), row.names = c(NA, 
-12L), class = c("data.table", "data.frame"), 
.internal.selfref = )
R>

The dput() output can be evaluated by R and re-creates the data structure.

Next, your data analysis. First, you want to work by group! One of the strengths of data.table is the grouping. Second, you probably want something like "max order indices while zero". The following should work:

R> dt[, rle:=rleid(RR), by="CC"]
R> dt
          Date CC  RR rle
 1: 31.01.1985 C1 0.1   1
 2: 28.02.1985 C1 0.8   2
 3: 29.03.1985 C1 0.9   3
 4: 30.04.1985 C1 2.0   4
 5: 31.05.1985 C1 0.0   5
 6: 28.06.1985 C1 0.0   5
 7: 31.01.1985 C2  NA   1
 8: 28.02.1985 C2 0.0   2
 9: 29.03.1985 C2 0.1   3
10: 30.04.1985 C2 0.0   4
11: 31.05.1985 C2 0.4   5
12: 28.06.1985 C2 0.0   6
R>

We use the super-useful rleid() function from data.table to get the index value of the rle() repetition. Now, we "just" need to group again, and if the rle value is equal to the max(rle) value we have the position we want and set the value to NA.

R> dt[, ind:=which.max(rle), by=CC]                # find max index
R> dt[ ind==rle & RR==0.0, RR:=NA_real_, by=CC ]   # at max ind AND zero set NA
R> dt[, ind:=NULL ]                                # remove index helper
R> dt
          Date CC  RR rle
 1: 31.01.1985 C1 0.1   1
 2: 28.02.1985 C1 0.8   2
 3: 29.03.1985 C1 0.9   3
 4: 30.04.1985 C1 2.0   4
 5: 31.05.1985 C1  NA   5
 6: 28.06.1985 C1  NA   5
 7: 31.01.1985 C2  NA   1
 8: 28.02.1985 C2 0.0   2
 9: 29.03.1985 C2 0.1   3
10: 30.04.1985 C2 0.0   4
11: 31.05.1985 C2 0.4   5
12: 28.06.1985 C2  NA   6
R>

Again with just the code:

dt <- fread("/tmp/data.txt")
dt[, rle:=rleid(RR), by=CC]
dt[, ind:=which.max(rle), by=CC]
dt[ ind==rle & RR==0.0, RR:=NA_real_, by=CC ]
dt[, ind:=NULL ]
dt

Delete zeros at end of date range for each group

Answers (2)

Related Questions