Reputation: 11
I have a data table in R with 3 features as follow
DT_A <- data.table(sid=c(1,1,2,2,2,3,3,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
"2014-06-23","2014-06-24","2014-06-25","2014-06-26")),
Status1 = c("A","B","A","A","B","A","A","A","B","B"))
The data looks like this
sid date Status1
1: 1 2014-06-22 A
2: 1 2014-06-23 B
3: 2 2014-06-22 A
4: 2 2014-06-23 A
5: 2 2014-06-24 B
6: 3 2014-06-22 A
7: 3 2014-06-23 A
8: 2 2014-06-24 A
9: 3 2014-06-25 B
10: 3 2014-06-26 B
How can i check the Status 1 and see if there are 3 rows in a row that has value A (like row 6,7,8) then we will delete these?
Upvotes: 0
Views: 98
Reputation: 42544
The question is tagged data.table
, so I'll try to give an appropriate answer:
DT_A[!DT_A[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-25 B 7: 3 2014-06-26 B
As pointed out by Frank, my first answer (now edited) was working just for the given sample data set provided by the OP but failed for other test cases.
So, the edited code is applied to some other test cases.
Case B: 3 consecutive rows of letters A
and B
DT_B <- data.table(
sid=c(1,1,2,2,2,3,3,2,3,3,3),
date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
"2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")),
Status1 = c("A","B","A","A","B","A","A","A","B","B","B"))
DT_B
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 A 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 B
DT_B[!DT_B[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-25 B 7: 3 2014-06-26 B 8: 3 2014-06-26 B
Only the 3 consecutive rows containing letter A
(rows 6 to 8) are removed.
Case C: Nothing to remove
DT_C <- data.table(
sid=c(1,1,2,2,2,3,3,2,3,3,3),
date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
"2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")),
Status1 = c("A","B","A","A","B","A","A","C","B","B","C"))
DT_C
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 C 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 C
DT_C[!DT_C[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
sid date Status1 1: 1 2014-06-22 A 2: 1 2014-06-23 B 3: 2 2014-06-22 A 4: 2 2014-06-23 A 5: 2 2014-06-24 B 6: 3 2014-06-22 A 7: 3 2014-06-23 A 8: 2 2014-06-24 C 9: 3 2014-06-25 B 10: 3 2014-06-26 B 11: 3 2014-06-26 C
No row is removed as there are no 3 consecutive rows containing A
.
Case D: Edge case: remove all rows
DT_D <- DT_A[6:8]
DT_D
sid date Status1 1: 3 2014-06-22 A 2: 3 2014-06-23 A 3: 2 2014-06-24 A
DT_D[!DT_D[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
Empty data.table (0 rows) of 3 cols: sid,date,Status1
All rows are removed and an empty data.table is returned because the input data.table consists only of 3 rows with letter A
.
Upvotes: 2
Reputation: 5673
I am supposing you are making a mistake in your sid definition, and that your 3 lines have all sid = 3. If not, sorry my answer will not work. If it is the case the solution can be one line:
DT_A[,.SD[.N < 3 | Status1 != "A",], by = .(sid,Status1)]
Is a simple line that does what you want : it select the data where the number of line is less than 3 or different than B in column Status1 (that is the negation of your selection you want to make to delete : at least 3 A) when grouping by sid and Status1. Hope it helps
Upvotes: 1
Reputation: 32548
with(rle(DT_A$Status1 == "A"), {
unlist(lapply(which(lengths >= 3), function(i)
(1+cumsum(lengths)[i-1]):cumsum(lengths)[i]))
})
#[1] 6 7 8
Upvotes: 1