Boris
Boris

Reputation: 131

How to generate number of prior year an individual had its current X score?

I want to simply generate a variable that counts number of prior years an individual had its current score.

For example, if we look at personID from this reproducible example

set.seed(987)
mydata <- data.frame(
  personID  = rep(c(1:10), each = 10),
  year= rep(c(1991:2000), each = 1),
  score = sample(c(0, 1, 2), 100, replace = TRUE)
)

Here is the personI== 5

personID    year    score
5           1991    2
5           1992    1
5           1993    0
5           1994    0
5           1995    0
5           1996    0
5           1997    2
5           1998    0
5           1999    1
5           2000    1

What I want to generate is variable Z and it should look like this

personID    year    score   Z
5           1991       2    0
5           1992       1    0
5           1993       0    0
5           1994       0    1
5           1995       0    2
5           1996       0    3
5           1997       2    0
5           1998       0    0
5           1999       1    0
5           2000       1    1

I have been trying do this with this code

mydata1 <- with(mydata, ave(score, personID, FUN= 
                      function(x) cumsum(c(TRUE, diff(x)<0))))
mydata$Z <- with(mydata, ave(mydata1, mydata1, personID, FUN= seq_along)-1)

It doesn't do the work, I need to specify somehow that what I want to count is the current score (number of years when an individual had score 0, 1 or 2). The diff(x)<0part is also wrong - I tried different things and at the end it was impossible to remove it somehow..

Upvotes: 3

Views: 81

Answers (4)

aichao
aichao

Reputation: 7455

With grouping by runs generated by rle found in this past answer, another possible solution using dplyr is:

library(dplyr)
result <- mydata %>% group_by(personID, 
                              run=with(rle(score), {rep(seq_along(lengths), lengths)})) %>%
                     mutate(Z=0:(n()-1)) %>% ungroup() %>% select(-run)
head(result[result$personID==5,], n=10)
##   personID  year score     Z
##      <int> <int> <dbl> <int>
##1         5  1991     2     0
##2         5  1992     1     0
##3         5  1993     0     0
##4         5  1994     0     1
##5         5  1995     0     2
##6         5  1996     0     3
##7         5  1997     2     0
##8         5  1998     0     0
##9         5  1999     1     0
##10        5  2000     1     1

Upvotes: 2

Bernhard
Bernhard

Reputation: 4427

Your example outcome does not appear to be correct. I think, the following does, what you describe in words:

tapply(mydata$score, INDEX = mydata$personID, 
       FUN = function(x){ 
               Z = numeric()
               for (line in 1:length(x)){
                  Z[line] <- sum(x[1:line]==x[line])
                  }
               return(Z-1)
               })

This will give a list, each entry of the list is a person. To change into a vector just unlist()

For ID == 5 this will return

$`5`
 [1] 0 0 0 1 2 3 1 4 1 2

Upvotes: 3

David Arenburg
David Arenburg

Reputation: 92302

Here's a possible solution using run-length encoding using data.table for convenience

library(data.table)
setDT(mydata)[, Z := 1:.N - 1L, by = .(personID, rleid(score))]

# Check results
mydata[personID == 5]
#     personID year score Z
#  1:        5 1991     2 0
#  2:        5 1992     1 0
#  3:        5 1993     0 0
#  4:        5 1994     0 1
#  5:        5 1995     0 2
#  6:        5 1996     0 3
#  7:        5 1997     2 0
#  8:        5 1998     0 0
#  9:        5 1999     1 0
# 10:        5 2000     1 1

Or using the development version (v>=1.9.7) you could enhance it using rowid

setDT(mydata)[, Z := rowid(score) - 1L, by = .(personID, rleid(score))]

Upvotes: 5

lmo
lmo

Reputation: 38510

If my interpretation is correct, you want to count the number of previous years with the same score for each ID and set this count to 0 each time the previous score differs. If this is the case, and your data is sorted as in the example, I believe the following base R code will work.

# calculate the length of repeated scores for each ID
reps <- unlist(with(mydata, tapply(score, personID,
                                   FUN=function(i) rle(i)$lengths)), use.names=FALSE)
# expand the lengths into a count
mydata$Z <- unlist(lapply(reps, function(i) 0:(i-1)))

To show the same output as in the question,

mydata[mydata$personID == 5,]
   personID year score z
41        5 1991     2 0
42        5 1992     1 0
43        5 1993     0 0
44        5 1994     0 1
45        5 1995     0 2
46        5 1996     0 3
47        5 1997     2 0
48        5 1998     0 0
49        5 1999     1 0
50        5 2000     1 1

Upvotes: 4

Related Questions