Reputation: 65
I have data in format similar to this:
ID Timestamp Status
1 1/1/2014 1
2 1/1/2014 1
3 1/2/2014 1
4 1/3/2014 1
1 1/3/2014 2
3 1/3/2014 2
4 1/5/2014 2
5 1/5/2014 1
1 1/6/2014 3
2 1/7/2014 3
3 1/8/2014 3
4 1/9/2014 3
5 1/10/2014 2
6 1/10/2014 1
3 1/10/2014 4
3 1/10/2014 5
3 1/10/2014 6
1 1/11/2014 4
2 1/11/2014 3
3 1/11/2014 4
3 1/11/2014 2
5 1/11/2014 3
6 1/12/2014 4
7 1/12/2014 5
5 1/12/2014 6
4 1/12/2014 7
2 1/13/2014 3
3 1/13/2014 4
1 1/14/2014 5
5 1/14/2014 3
6 1/14/2014 4
1 1/15/2014 6
1 1/16/2014 7
Each ID have to go through status from 1 to 7 in order 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
However, due to the data entry error, sometime the 4 happen before 3, 6 happen before 2, 7 happen before 1.
So can R help to identify the IDs that have status change not as the rule?
For above sample data - Only ID 1 have the correct status history 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
- Others are note correct.
Thank you very much!
Upvotes: 1
Views: 71
Reputation: 206401
Assuming it's OK for ID's to skip and duplicate statuses (just never go backwards) You could do something like this. Here's the data.frame I used to test
dd<-structure(list(ID = c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L, 6L, 3L, 3L, 3L, 1L, 2L, 3L, 3L, 5L, 6L, 7L, 5L, 4L,
2L, 3L, 1L, 5L, 6L, 1L, 1L), Timestamp = structure(c(18262, 18262,
18263, 18264, 18264, 18264, 18266, 18266, 18267, 18268, 18269,
18270, 18271, 18271, 18271, 18271, 18271, 18272, 18272, 18272,
18272, 18272, 18273, 18273, 18273, 18273, 18274, 18274, 18275,
18275, 18275, 18276, 18277), class = "Date"), Status = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 3L, 3L, 3L, 2L, 1L, 4L, 5L, 6L,
4L, 3L, 4L, 2L, 3L, 4L, 5L, 6L, 7L, 3L, 4L, 5L, 3L, 4L, 6L, 7L
)), .Names = c("ID", "Timestamp", "Status"), row.names = c(NA,
-33L), class = "data.frame")
And then I define a helper function to test that all the difference in the status values are are all 1 and that we have all seven of them
isgoodseq<-function(x) {
length(x) ==7 & all(diff(x)==1) & min(x)==1
}
Now we run this for each ID (making sure the data is in the correct order)
with(dd[order(dd$Timestamp, dd$ID, dd$Status),],
tapply(Status, ID, isgoodseq))
And that gives us
1 2 3 4 5 6 7
TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Which means 1 is the only good ID
Upvotes: 1