Reputation: 7733
I have a data-frame like as shown below
DF = structure(list(Age_visit = c(48, 48, 48, 49, 49, 77), Date_1 = c("8/6/2169 9:40", "8/6/2169 9:40",
"8/6/2169 9:41", "8/6/2169 9:42", "24/7/2169 8:31", "12/9/2169 10:30",
"19/6/2237 12:15"), Date_2 = c("NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA",
"NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA",
"NA-NA-NA NA:NA:NA"), person_id = c("21",
"21",
"21",
"21",
"21",
"21",
"31"
), enc_id = c("A21BC","A21BC",
"A22BC",
"A23BC",
"A24BC",
"A25BC",
"A31BC"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
dataframe
Age_visit Date_1 Date_2 person_id enc_id
<dbl> <chr> <chr> <chr> <chr>
1 48 8/6/2169 9:40 NA-NA-NA NA:NA:NA 21 A21BC
2 48 8/6/2169 9:40 NA-NA-NA NA:NA:NA 21 A21BC
3 48 8/6/2169 9:41 NA-NA-NA NA:NA:NA 21 A22BC
4 49 8/6/2169 9:42 NA-NA-NA NA:NA:NA 21 A23BC
5 49 24/7/2169 8:31 NA-NA-NA NA:NA:NA 21 A24BC
6 77 12/9/2169 10:30 NA-NA-NA NA:NA:NA 31 A31BC
I have two rules/steps to be implemented.
Rule-1 (step-1)
First, remove duplicates based on 3 columns like Date_1
, person_id
, enc_id
DF[!duplicated(DF[,c('Date_1','person_id','enc_id')]),] # this will remove 1st row as it's a plain straight forward duplicate
Rule-2 (step-2)
From the output of step-1, collapse near duplicate records (notice tiny differences in DATE_1
and enc_id
columns) based on time into one single record if the time difference between these records is less than hour.
For ex, if you see person_id = 21
, you can see that after step-1, all his Date_1
time values are on the same day but the difference is only one minute (9:40 --> 9:41 --> 9:42). Since it's less than an hour (60 mins), we collapse all of them into one single record and retain only the first record (which is for 9:40). We do this check for each subject in the dataframe
I have removed the duplicates based on few columns like as shown below
DF[!duplicated(DF[,c('Date_1','person_id','enc_id')]),]
I expect my output to be like as shown below
Age_visit Date_1 Date_2 person_id enc_id
<dbl> <chr> <chr> <chr> <chr>
1 48 8/6/2169 9:40 NA-NA-NA NA:NA:NA 21 A21BC
4 49 24/7/2169 8:31 NA-NA-NA NA:NA:NA 21 A24BC
5 77 12/9/2169 10:30 NA-NA-NA NA:NA:NA 31 A31BC
Upvotes: 3
Views: 641
Reputation: 25225
A rolling join option using data.table
:
DT[, c("rn", "hrago") := .(.I, Date_1 - 60 * 60)]
DT[DT[DT, on=.(person_id, Date_1=hrago), roll=-Inf, unique(rn)]]
output:
Age_visit Date_1 person_id enc_id rn hrago
1: 48 2169-06-08 09:40:00 21 A21BC 1 2169-06-08 08:40:00
2: 49 2169-07-24 08:31:00 21 A24BC 5 2169-07-24 07:31:00
3: 77 2169-09-12 10:30:00 31 A31BC 6 2169-09-12 09:30:00
data:
library(data.table)
DT <- fread("Age_visit Date_1 person_id enc_id
48 8/6/2169-9:40 21 A21BC
48 8/6/2169-9:40 21 A21BC
48 8/6/2169-9:41 21 A22BC
49 8/6/2169-9:42 21 A23BC
49 24/7/2169-8:31 21 A24BC
77 12/9/2169-10:30 31 A31BC")
DT[, Date_1 := as.POSIXct(Date_1, format="%d/%m/%Y-%H:%M")]
Explanation:
1) DT[DT, on=.(person_id, Date_1=hrago),
is a self-join using person_id
from both tables and Date_1
from right table and hrago
from left table.
2) roll=-Inf
rolls the observation in the right table backwards if an identical match for the observation in the left table is not found
3) unique(rn)
takes the unique rows from the right table and then filter the table for these rows.
Upvotes: 3
Reputation: 1724
Your question can be solved using a dplyr pipeline.
distinct()
.lag()
. This must be in a group_by()
on person_id to make sure that time stamps are not shifted to other people. Also, it is important to make sure the date is arrange properly (using the arrange()
).library(dplyr)
DF %>%
distinct(Date_1, person_id , enc_id, .keep_all = T) %>%
mutate(Date_1 = as.POSIXct(Date_1, format = '%d/%m/%Y %H:%M')) %>%
group_by(person_id) %>%
arrange(Date_1) %>%
mutate(Date_lag = lag(Date_1)) %>%
ungroup() %>%
mutate(Date_diff = difftime(Date_1, Date_lag, units = 'secs')) %>%
filter(is.na(Date_diff) | Date_diff >= 3600) %>%
select(Age_visit, Date_1, Date_2, person_id, enc_id)
Upvotes: 2
Reputation: 5673
You can do both in the same step, by checking successive time difference. Duplicates have a time difference of 0:
library(dplyr)
library(lubridate)
DF %>%
group_by(person_id)%>%
mutate(Date_1 = dmy_hm(Date_1)) %>%
arrange((Date_1)) %>%
filter(c(5000,diff(Date_1))>3600)
Age_visit Date_1 Date_2 person_id enc_id
<dbl> <dttm> <chr> <chr> <chr>
1 48 2169-06-08 09:40:00 NA-NA-NA NA:NA:NA 21 A21BC
2 49 2169-07-24 08:31:00 NA-NA-NA NA:NA:NA 21 A24BC
3 77 2169-09-12 10:30:00 NA-NA-NA NA:NA:NA 31 A25BC
There was a mistake in your data (person_id 31 was missing). Here is the one I used:
DF = structure(list(Age_visit = c(48, 48, 48, 49, 49, 77), Date_1 = c("8/6/2169 9:40", "8/6/2169 9:40",
"8/6/2169 9:41", "8/6/2169 9:42", "24/7/2169 8:31", "12/9/2169 10:30",
"19/6/2237 12:15"), Date_2 = c("NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA",
"NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA", "NA-NA-NA NA:NA:NA",
"NA-NA-NA NA:NA:NA"), person_id = c("21",
"21",
"21",
"21",
"21",
"31"
), enc_id = c("A21BC","A21BC",
"A22BC",
"A23BC",
"A24BC",
"A25BC",
"A31BC"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
Upvotes: 1