Reputation: 131
I want to simply generate a variable that counts number of prior years an individual had its current score.
For example, if we look at personID from this reproducible example
set.seed(987)
mydata <- data.frame(
personID = rep(c(1:10), each = 10),
year= rep(c(1991:2000), each = 1),
score = sample(c(0, 1, 2), 100, replace = TRUE)
)
Here is the personI== 5
personID year score
5 1991 2
5 1992 1
5 1993 0
5 1994 0
5 1995 0
5 1996 0
5 1997 2
5 1998 0
5 1999 1
5 2000 1
What I want to generate is variable Z and it should look like this
personID year score Z
5 1991 2 0
5 1992 1 0
5 1993 0 0
5 1994 0 1
5 1995 0 2
5 1996 0 3
5 1997 2 0
5 1998 0 0
5 1999 1 0
5 2000 1 1
I have been trying do this with this code
mydata1 <- with(mydata, ave(score, personID, FUN=
function(x) cumsum(c(TRUE, diff(x)<0))))
mydata$Z <- with(mydata, ave(mydata1, mydata1, personID, FUN= seq_along)-1)
It doesn't do the work, I need to specify somehow that what I want to count is the current score (number of years when an individual had score 0, 1 or 2). The diff(x)<0
part is also wrong - I tried different things and at the end it was impossible to remove it somehow..
Upvotes: 3
Views: 81
Reputation: 7455
With grouping by runs generated by rle
found in this past answer, another possible solution using dplyr
is:
library(dplyr)
result <- mydata %>% group_by(personID,
run=with(rle(score), {rep(seq_along(lengths), lengths)})) %>%
mutate(Z=0:(n()-1)) %>% ungroup() %>% select(-run)
head(result[result$personID==5,], n=10)
## personID year score Z
## <int> <int> <dbl> <int>
##1 5 1991 2 0
##2 5 1992 1 0
##3 5 1993 0 0
##4 5 1994 0 1
##5 5 1995 0 2
##6 5 1996 0 3
##7 5 1997 2 0
##8 5 1998 0 0
##9 5 1999 1 0
##10 5 2000 1 1
Upvotes: 2
Reputation: 4427
Your example outcome does not appear to be correct. I think, the following does, what you describe in words:
tapply(mydata$score, INDEX = mydata$personID,
FUN = function(x){
Z = numeric()
for (line in 1:length(x)){
Z[line] <- sum(x[1:line]==x[line])
}
return(Z-1)
})
This will give a list, each entry of the list is a person. To change into a vector just unlist()
For ID == 5
this will return
$`5`
[1] 0 0 0 1 2 3 1 4 1 2
Upvotes: 3
Reputation: 92302
Here's a possible solution using run-length encoding using data.table
for convenience
library(data.table)
setDT(mydata)[, Z := 1:.N - 1L, by = .(personID, rleid(score))]
# Check results
mydata[personID == 5]
# personID year score Z
# 1: 5 1991 2 0
# 2: 5 1992 1 0
# 3: 5 1993 0 0
# 4: 5 1994 0 1
# 5: 5 1995 0 2
# 6: 5 1996 0 3
# 7: 5 1997 2 0
# 8: 5 1998 0 0
# 9: 5 1999 1 0
# 10: 5 2000 1 1
Or using the development version (v>=1.9.7) you could enhance it using rowid
setDT(mydata)[, Z := rowid(score) - 1L, by = .(personID, rleid(score))]
Upvotes: 5
Reputation: 38510
If my interpretation is correct, you want to count the number of previous years with the same score for each ID and set this count to 0 each time the previous score differs. If this is the case, and your data is sorted as in the example, I believe the following base R code will work.
# calculate the length of repeated scores for each ID
reps <- unlist(with(mydata, tapply(score, personID,
FUN=function(i) rle(i)$lengths)), use.names=FALSE)
# expand the lengths into a count
mydata$Z <- unlist(lapply(reps, function(i) 0:(i-1)))
To show the same output as in the question,
mydata[mydata$personID == 5,]
personID year score z
41 5 1991 2 0
42 5 1992 1 0
43 5 1993 0 0
44 5 1994 0 1
45 5 1995 0 2
46 5 1996 0 3
47 5 1997 2 0
48 5 1998 0 0
49 5 1999 1 0
50 5 2000 1 1
Upvotes: 4