AahuM
AahuM

Reputation: 73

Delete zeros at end of date range for each group

My data.table looks like this:

+------------+--------+
|    Date    | CC  RR |
+------------+--------+
| 31.01.1985 | C1  0.1|
| 28.02.1985 | C1  0.8|
| 29.03.1985 | C1  0.9|
| 30.04.1985 | C1  2  |
| 31.05.1985 | C1  0  |
| 28.06.1985 | C1  0  |
| 31.01.1985 | C2  NA |
| 28.02.1985 | C2  0  |
| 29.03.1985 | C2  0.1|
| 30.04.1985 | C2  0  |
| 31.05.1985 | C2  0.4|
| 28.06.1985 | C2  0  |

The CC (CompanyCode column) column has unique company codes which identify the company, the Date column has dates, and the RR column has stock returns. My actual dataset has over 2k companies and a period of 30 years.

My problem is the following: for every company I want to convert the value for RR in the last rows for each company from zeros to NAs. Specifically, I want to change the value of RR for the following rows from 0 to NA:

I do not want to change intermediate dates such as 28.02.1985 | c2 or 30.04.1985 c2, and I want to keep existing NAs such as 31.01.1985 | C2.

I have tried to use

length(Data[, CC])-match(unique(Data[,CC]),rev(Test3[,CC]))+1

to get a list of indexes of the last occurence of the respective company code and then iterate back until the first non-zero value in the RR column, but I dont know how to apply this on data.table in r...

Of course I could convert all zeros, but this doesn't solve my problem:

Date[RR==0, "RR"] <- NA

Could someone help me out on this? I would appreciate any help. Kind regards.

Upvotes: 0

Views: 111

Answers (2)

Dirk is no longer here
Dirk is no longer here

Reputation: 368449

First off, please post real useable data and not ascii art. I quickly edited what you posted in to a space-delimited file and read it:

R> dt <- fread("/tmp/data.txt")
R> dt
          Date CC  RR
 1: 31.01.1985 C1 0.1
 2: 28.02.1985 C1 0.8
 3: 29.03.1985 C1 0.9
 4: 30.04.1985 C1 2.0
 5: 31.05.1985 C1 0.0
 6: 28.06.1985 C1 0.0
 7: 31.01.1985 C2  NA
 8: 28.02.1985 C2 0.0
 9: 29.03.1985 C2 0.1
10: 30.04.1985 C2 0.0
11: 31.05.1985 C2 0.4
12: 28.06.1985 C2 0.0
R> dput(dt)
structure(list(Date = c("31.01.1985", "28.02.1985", "29.03.1985", 
"30.04.1985", "31.05.1985", "28.06.1985", "31.01.1985", "28.02.1985", 
"29.03.1985", "30.04.1985", "31.05.1985", "28.06.1985"), CC = c("C1", 
"C1", "C1", "C1", "C1", "C1", "C2", "C2", "C2", "C2", "C2", "C2"
), RR = c(0.1, 0.8, 0.9, 2, 0, 0, NA, 0, 0.1, 0, 0.4, 0)), row.names = c(NA, 
-12L), class = c("data.table", "data.frame"), 
.internal.selfref = <pointer: 0x5601c8da9cd0>)
R> 

The dput() output can be evaluated by R and re-creates the data structure.

Next, your data analysis. First, you want to work by group! One of the strengths of data.table is the grouping. Second, you probably want something like "max order indices while zero". The following should work:

R> dt[, rle:=rleid(RR), by="CC"]
R> dt
          Date CC  RR rle
 1: 31.01.1985 C1 0.1   1
 2: 28.02.1985 C1 0.8   2
 3: 29.03.1985 C1 0.9   3
 4: 30.04.1985 C1 2.0   4
 5: 31.05.1985 C1 0.0   5
 6: 28.06.1985 C1 0.0   5
 7: 31.01.1985 C2  NA   1
 8: 28.02.1985 C2 0.0   2
 9: 29.03.1985 C2 0.1   3
10: 30.04.1985 C2 0.0   4
11: 31.05.1985 C2 0.4   5
12: 28.06.1985 C2 0.0   6
R> 

We use the super-useful rleid() function from data.table to get the index value of the rle() repetition. Now, we "just" need to group again, and if the rle value is equal to the max(rle) value we have the position we want and set the value to NA.

R> dt[, ind:=which.max(rle), by=CC]                # find max index
R> dt[ ind==rle & RR==0.0, RR:=NA_real_, by=CC ]   # at max ind AND zero set NA
R> dt[, ind:=NULL ]                                # remove index helper
R> dt
          Date CC  RR rle
 1: 31.01.1985 C1 0.1   1
 2: 28.02.1985 C1 0.8   2
 3: 29.03.1985 C1 0.9   3
 4: 30.04.1985 C1 2.0   4
 5: 31.05.1985 C1  NA   5
 6: 28.06.1985 C1  NA   5
 7: 31.01.1985 C2  NA   1
 8: 28.02.1985 C2 0.0   2
 9: 29.03.1985 C2 0.1   3
10: 30.04.1985 C2 0.0   4
11: 31.05.1985 C2 0.4   5
12: 28.06.1985 C2  NA   6
R> 

Again with just the code:

dt <- fread("/tmp/data.txt")
dt[, rle:=rleid(RR), by=CC]
dt[, ind:=which.max(rle), by=CC]
dt[ ind==rle & RR==0.0, RR:=NA_real_, by=CC ]
dt[, ind:=NULL ]
dt

Upvotes: 3

DanY
DanY

Reputation: 6073

One approach with data.table:

Sort dates from old to new and then create a cumulative sum of rr. Then all rows where the cumsum is 0, we change rr to NA. Note that because cumsum doesn't have an na.rm arg, there are a couple of extra steps.

# create example data
df <- data.frame(
    date = rep(seq(as.Date("1985-01-31"), by="day", length.out=6), 2),
    cc   = rep(c("c1", "c2"), each=6),
    rr   = c(0.1, 0.8, 0.9, 2, 0, 0, NA, 0, 0.1, 0, 0.4, 0),
    stringsAsFactors = FALSE
)

# change to data.table
library(data.table)
setDT(df)

# sort date old to new
df <- df[order(cc, -date)]

# save a copy of the rr col
df[ , rr_orig := rr]

# turn NAs into 0s because no "na.rm" arg in cumsum
df[is.na(rr), rr := 0]

# create cumsum
df[ , cumrr := cumsum(rr), by=cc]

# replace 0s with NAs in rr, wherever cumsum is 0
df[cumrr == 0, rr := NA]

# put the NAs back into rr from rr_orig
df[is.na(rr_orig), rr := NA]

# clean up by deleting rr_orig and cumrr cols, and re-sort date
df[ , c("rr_orig", "cumrr") := NULL]
df <- df[order(cc, date)]

Result

> df
          date cc  rr
 1: 1985-01-31 c1 0.1
 2: 1985-02-01 c1 0.8
 3: 1985-02-02 c1 0.9
 4: 1985-02-03 c1 2.0
 5: 1985-02-04 c1  NA
 6: 1985-02-05 c1  NA
 7: 1985-01-31 c2  NA
 8: 1985-02-01 c2 0.0
 9: 1985-02-02 c2 0.1
10: 1985-02-03 c2 0.0
11: 1985-02-04 c2 0.4
12: 1985-02-05 c2  NA

Upvotes: 2

Related Questions