Remove duplicate dates based on a separate column

Question

The head of my data looks like:

     IndID Event       Date
148   P01     1 2011-03-04
147   P01     1 2011-03-11
149   P01     0 2011-03-13
150   P01     0 2011-03-14
151   P01     0 2011-03-14
152   P01     0 2011-03-14

There are 3 individuals (P01, P03, P06). "Event" specifies whether an event occurred (0=no, 1=yes) "Date" is the monitoring date.

My goal is to remove duplicate dates based on the Event column for each individual. Where duplicates exist and no Event has occurred (all Event values are 0 for a given Date), then duplicates should be removed and Event should remain 0. When duplicate dates are present and there has been an Event (meaning at least one of the Event values is 1), then I am trying to removing duplicates and leave a single row for the given date with 1 in the Event column.

While this post was useful, I am looking for more specifics. Currently I am struggling with a nested for() loop that is likely overly complicated and not necessary.

Any suggestions would be greatly appreciated! My data is below:

If it is relevant, once I have a single Date for each individual, I then hope to add the dates that are missing by merging a sequence (from the first to the last date of each individual) with my existing data and add a 0 within the Event column to the new entries.

DP <- structure(list(IndID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L), .Label = c("P01", "P02", "P03", "P05", "P06", "P07", 
"P08", "P09", "P10", "P11", "P12", "P13"), class = "factor"), 
    Event = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 
    1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L
    ), Date = structure(c(1299196800, 1299801600, 1299974400, 
    1300060800, 1300060800, 1300060800, 1300147200, 1300320000, 
    1300406400, 1300406400, 1300492800, 1300492800, 1300665600, 
    1300924800, 1301097600, 1302048000, 1302134400, 1302220800, 
    1302652800, 1302825600, 1302998400, 1303084800, 1303171200, 
    1303257600, 1303689600, 1303862400, 1304208000, 1304380800, 
    1304553600, 1304640000, 1305072000, 1305590400, 1305676800, 
    1305676800, 1305676800, 1305763200, 1305936000, 1306022400, 
    1306022400, 1306108800, 1306368000, 1323129600, 1323388800, 
    1323648000, 1323993600, 1324080000, 1324166400, 1324339200, 
    1327622400, 1327622400, 1327795200, 1327881600, 1327881600, 
    1328486400, 1328659200, 1328832000, 1329177600), class = c("POSIXct", 
    "POSIXt"), tzone = "GMT")), .Names = c("IndID", "Event", 
"Date"), row.names = c(148L, 147L, 149L, 150L, 151L, 152L, 153L, 
51L, 52L, 154L, 53L, 155L, 145L, 156L, 157L, 373L, 382L, 384L, 
361L, 385L, 349L, 387L, 386L, 388L, 341L, 360L, 339L, 334L, 366L, 
340L, 363L, 352L, 351L, 353L, 355L, 397L, 336L, 337L, 354L, 335L, 
371L, 475L, 417L, 472L, 492L, 494L, 493L, 473L, 468L, 497L, 495L, 
486L, 499L, 470L, 487L, 496L, 469L), class = "data.frame")

MrFlick · Accepted Answer

Really you're not so much removing duplicates are you are simply aggregating. It sounds like you just want the maximum Event value for each date for each individual (either 1 if there was an event or 0 if not).

aggregate(Event~Date+IndID, DP, max)

That should do the trick. If you have other columns of data to keep track of as well, try this method

inddate <- with(DP, ave(Event, Date, IndID, 
    FUN=function(x)seq_along(x)==which.max(x)))
DP[inddate==1, ]

Remove duplicate dates based on a separate column

Answers (1)

Related Questions