Reputation: 3441
The head of my data looks like:
IndID Event Date
148 P01 1 2011-03-04
147 P01 1 2011-03-11
149 P01 0 2011-03-13
150 P01 0 2011-03-14
151 P01 0 2011-03-14
152 P01 0 2011-03-14
There are 3 individuals (P01, P03, P06). "Event" specifies whether an event occurred (0=no, 1=yes) "Date" is the monitoring date.
My goal is to remove duplicate dates based on the Event column for each individual. Where duplicates exist and no Event has occurred (all Event values are 0 for a given Date), then duplicates should be removed and Event should remain 0. When duplicate dates are present and there has been an Event (meaning at least one of the Event values is 1), then I am trying to removing duplicates and leave a single row for the given date with 1 in the Event column.
While this post was useful, I am looking for more specifics. Currently I am struggling with a nested for()
loop that is likely overly complicated and not necessary.
Any suggestions would be greatly appreciated! My data is below:
If it is relevant, once I have a single Date for each individual, I then hope to add the dates that are missing by merging a sequence (from the first to the last date of each individual) with my existing data and add a 0 within the Event column to the new entries.
DP <- structure(list(IndID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L), .Label = c("P01", "P02", "P03", "P05", "P06", "P07",
"P08", "P09", "P10", "P11", "P12", "P13"), class = "factor"),
Event = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L,
1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L
), Date = structure(c(1299196800, 1299801600, 1299974400,
1300060800, 1300060800, 1300060800, 1300147200, 1300320000,
1300406400, 1300406400, 1300492800, 1300492800, 1300665600,
1300924800, 1301097600, 1302048000, 1302134400, 1302220800,
1302652800, 1302825600, 1302998400, 1303084800, 1303171200,
1303257600, 1303689600, 1303862400, 1304208000, 1304380800,
1304553600, 1304640000, 1305072000, 1305590400, 1305676800,
1305676800, 1305676800, 1305763200, 1305936000, 1306022400,
1306022400, 1306108800, 1306368000, 1323129600, 1323388800,
1323648000, 1323993600, 1324080000, 1324166400, 1324339200,
1327622400, 1327622400, 1327795200, 1327881600, 1327881600,
1328486400, 1328659200, 1328832000, 1329177600), class = c("POSIXct",
"POSIXt"), tzone = "GMT")), .Names = c("IndID", "Event",
"Date"), row.names = c(148L, 147L, 149L, 150L, 151L, 152L, 153L,
51L, 52L, 154L, 53L, 155L, 145L, 156L, 157L, 373L, 382L, 384L,
361L, 385L, 349L, 387L, 386L, 388L, 341L, 360L, 339L, 334L, 366L,
340L, 363L, 352L, 351L, 353L, 355L, 397L, 336L, 337L, 354L, 335L,
371L, 475L, 417L, 472L, 492L, 494L, 493L, 473L, 468L, 497L, 495L,
486L, 499L, 470L, 487L, 496L, 469L), class = "data.frame")
Upvotes: 0
Views: 99
Reputation: 206232
Really you're not so much removing duplicates are you are simply aggregating. It sounds like you just want the maximum Event value for each date for each individual (either 1 if there was an event or 0 if not).
aggregate(Event~Date+IndID, DP, max)
That should do the trick. If you have other columns of data to keep track of as well, try this method
inddate <- with(DP, ave(Event, Date, IndID,
FUN=function(x)seq_along(x)==which.max(x)))
DP[inddate==1, ]
Upvotes: 1