Reputation: 51
Dear Stackflow Community,
My dataset consists of Patient ID and Dates where a specific laboratory value was above the threshold.
Now I would like to know how to get the minimum difference of all those dates per patient.
My goal is, to identify patients, with two or more dates which should be in a 2-year window.
According to my research plan, I would like to calculate a column to state if the minimum difference is not more than two years.
My dataset looks now similar to this:
Patient_ID Date1 Date2 Date3 Date4 Date5 Date6 Date7 Date8 Date9 Date10 Date11 Date12 Date13 Date14
1 1 13.05.19 13.09.07 13.05.09 13.01.11 13.09.12 13.05.14 13.01.16 13.09.17 14.09.17 15.09.17 16.09.17 17.09.17 18.09.17 19.09.17
2 2 13.07.18 13.11.06 13.07.08 13.03.10 13.11.11 13.07.13 13.03.15 13.11.16 13.09.07 13.07.98 13.05.89 13.03.80 13.01.71 13.11.61
3 3 13.09.17 13.01.06 13.09.07 13.05.09 13.01.11 13.09.12 13.05.14 13.01.16 13.11.06 13.09.97 13.07.88 13.05.79 13.03.70 13.01.61
4 4 13.11.16 13.03.05 13.11.06 13.07.08 13.03.10 13.11.11 13.07.13 13.03.15 13.01.06 13.11.96 13.09.87 13.07.78 13.05.69 13.03.60
5 5 13.01.16 13.05.04 13.01.06 13.09.07 13.05.09 13.01.11 13.09.12 13.05.14 13.03.05 13.01.96 13.11.86 13.09.77 13.07.68
6 6 13.03.15 13.07.03 13.03.05 13.11.06 13.07.08 13.03.10 13.11.11 13.09.07 13.05.04 13.01.01 13.09.97 13.05.94 13.01.91
7 7 13.05.14 13.09.02 13.05.04 13.01.06 13.09.07 13.05.09 13.01.11 13.11.06 13.07.03
8 8 13.07.13 13.11.01 13.07.03 13.03.05 13.11.06 13.07.08 13.03.10 13.01.06 13.09.02
9 9 13.09.12 13.01.01 13.09.02 13.09.07 13.09.12 13.09.17 13.09.22 13.03.05 13.11.01
10 10 13.11.11 13.03.00 13.11.01 13.11.06 13.11.11 13.11.16 13.11.21 13.05.04 13.01.01
11 11 13.01.11 13.05.99 13.01.01 13.01.06 13.01.11 13.01.16 13.01.21 13.07.03 13.03.00
12 12 13.03.10 13.07.98 13.03.00 13.03.05 13.03.10 13.03.15 13.05.19 13.09.02 13.05.99
13 13 13.05.09 13.09.97 13.05.99 13.05.04 13.09.07 13.05.14 13.07.18 13.11.01
14 14 13.07.08 13.11.96 14.05.99 13.07.03 13.11.06 13.07.13 13.09.17 13.01.01
15 15 13.09.07 13.01.96 15.05.99 13.09.02 13.01.06 13.09.12 13.11.16 13.03.00
16 16 13.11.06 13.03.95 16.05.99 13.11.01 13.03.05 13.11.11 13.01.16 13.05.99
17 17 13.01.06 13.05.94 17.05.99 13.01.01 13.05.04 13.01.11 13.03.15
18 18 13.03.05 13.07.93 18.05.99 13.03.00 13.07.03 13.03.10 13.05.14
19 19 13.05.04 13.09.92 19.05.99 13.05.99 13.09.02 13.05.09 13.07.13
20 20 13.07.03 13.11.91 20.05.99 14.05.99 13.11.01 13.07.08 13.09.12
21 21 13.09.02 13.01.91 21.05.99 15.05.99 13.01.01 13.09.07 13.11.11
22 22 13.11.01 13.03.90 22.05.99 16.05.99 13.03.00 13.11.06 13.01.11
23 23 13.01.01 13.05.89 23.05.99 17.05.99 13.05.99 13.01.06 13.03.10
24 24 13.03.00 13.07.88 24.05.99 18.05.99 14.05.99 13.03.05
25 25 13.05.99 13.09.87 25.05.99 19.05.99
Best wishes and thank you very much for your help, Caro
Here is the code to access the small data set at the top.
structure(list(...1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25), Patient_ID = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25), Date1 = structure(c(1557705600, 1531440000,
1505260800, 1478995200, 1452643200, 1426204800, 1399939200, 1373673600,
1347494400, 1321142400, 1294876800, 1268438400, 1242172800, 1215907200,
1189641600, 1163376000, 1137110400, 1110672000, 1084406400, 1058054400,
1031875200, 1005609600, 979344000, 952905600, 926553600), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date2 = structure(c(1189641600, 1163376000,
1137110400, 1110672000, 1084406400, 1058054400, 1031875200, 1005609600,
979344000, 952905600, 926553600, 900288000, 874108800, 847843200,
821491200, 795052800, 768787200, 742521600, 716342400, 689990400,
663724800, 637286400, 611020800, 584755200, 558489600), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date3 = structure(c(1242172800, 1215907200,
1189641600, 1163376000, 1137110400, 1110672000, 1084406400, 1058054400,
1031875200, 1005609600, 979344000, 952905600, 926553600, 926640000,
926726400, 926812800, 926899200, 926985600, 927072000, 927158400,
927244800, 927331200, 927417600, 927504000, 927590400), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date4 = structure(c(1294876800, 1268438400,
1242172800, 1215907200, 1189641600, 1163376000, 1137110400, 1110672000,
1189641600, 1163376000, 1137110400, 1110672000, 1084406400, 1058054400,
1031875200, 1005609600, 979344000, 952905600, 926553600, 926640000,
926726400, 926812800, 926899200, 926985600, 927072000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date5 = structure(c(1347494400, 1321142400,
1294876800, 1268438400, 1242172800, 1215907200, 1189641600, 1163376000,
1347494400, 1321142400, 1294876800, 1268438400, 1189641600, 1163376000,
1137110400, 1110672000, 1084406400, 1058054400, 1031875200, 1005609600,
979344000, 952905600, 926553600, 926640000, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date6 = structure(c(1399939200, 1373673600,
1347494400, 1321142400, 1294876800, 1268438400, 1242172800, 1215907200,
1505260800, 1478995200, 1452643200, 1426204800, 1399939200, 1373673600,
1347494400, 1321142400, 1294876800, 1268438400, 1242172800, 1215907200,
1189641600, 1163376000, 1137110400, 1110672000, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date7 = structure(c(1452643200, 1426204800,
1399939200, 1373673600, 1347494400, 1321142400, 1294876800, 1268438400,
1663027200, 1636761600, 1610496000, 1557705600, 1531440000, 1505260800,
1478995200, 1452643200, 1426204800, 1399939200, 1373673600, 1347494400,
1321142400, 1294876800, 1268438400, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date8 = structure(c(1505260800, 1478995200,
1452643200, 1426204800, 1399939200, 1189641600, 1163376000, 1137110400,
1110672000, 1084406400, 1058054400, 1031875200, 1005609600, 979344000,
952905600, 926553600, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date9 = structure(c(1505347200, 1189641600,
1163376000, 1137110400, 1110672000, 1084406400, 1058054400, 1031875200,
1005609600, 979344000, 952905600, 926553600, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Date10 = structure(c(1505433600, 900288000,
874108800, 847843200, 821491200, 979344000, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date11 = structure(c(1505520000, 611020800,
584755200, 558489600, 532224000, 874108800, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date12 = structure(c(1505606400, 321753600,
295401600, 269136000, 242956800, 768787200, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date13 = structure(c(1505692800, 32572800,
6134400, -20131200, -46396800, 663724800, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Date14 = structure(c(1505779200, -256694400,
-282960000, -309398400, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"))
Upvotes: 1
Views: 331
Reputation: 826
If you're not concerned with efficiency, you can use the outer function. Lets say your dataset is call d
cols = colnames(d)[-1] #remove id column
resultNames = outer(cols, cols, paste, sep="_") #column names
result = outer(2:ncol(d), 2:ncol(d), function(x, y) d[,x] - d[,y]) #do pairwise differences, ignoring first column
colnames(result) <- resultNames
Now result
will have your pairwise differences. I'm not sure if you want to flag all occurrences or the first, but you can use the which function.
This isn't the most efficient however as the running time is O(n^2)
. A more efficient algorithm is using divide and conquer but you would need to code an implementation of this
Upvotes: 1
Reputation: 30559
Here is one way to approach this using tidyverse
.
First, put dates into long format (tidier). Then put the dates in chronological order with arrange
. Then remove the NA
.
After grouping by Patient_ID
, you can calculate differences between dates, and take the minimum difference. Your diff
column should be minimum difference between dates.
Does this match the expected outcome?
library(tidyverse)
df %>%
pivot_longer(cols = -c(...1, Patient_ID), names_to = "number", values_to = "date", names_pattern = "Date(\\d+)") %>%
arrange(Patient_ID, date) %>%
drop_na() %>%
group_by(Patient_ID) %>%
mutate(diff = c(NA, diff(date))) %>%
slice_min(diff, with_ties = FALSE)
Output
# A tibble: 25 x 5
# Groups: Patient_ID [25]
...1 Patient_ID number date diff
<dbl> <dbl> <chr> <dttm> <dbl>
1 1 1 9 2017-09-14 00:00:00 1
2 2 2 9 2007-09-13 00:00:00 304
3 3 3 9 2006-11-13 00:00:00 304
4 4 4 3 2006-11-13 00:00:00 304
5 5 5 9 2005-03-13 00:00:00 304
6 6 6 3 2005-03-13 00:00:00 304
7 7 7 9 2003-07-13 00:00:00 303
8 8 8 3 2003-07-13 00:00:00 303
9 9 9 5 2012-09-13 00:00:00 0
10 10 10 5 2011-11-13 00:00:00 0
11 11 11 5 2011-01-13 00:00:00 0
12 12 12 5 2010-03-13 00:00:00 0
13 13 13 3 1999-05-13 00:00:00 607
14 14 14 1 2008-07-13 00:00:00 608
15 15 15 8 2000-03-13 00:00:00 303
16 16 16 3 1999-05-16 00:00:00 3
17 17 17 4 2001-01-13 00:00:00 607
18 18 18 4 2000-03-13 00:00:00 300
19 19 19 3 1999-05-19 00:00:00 6
20 20 20 3 1999-05-20 00:00:00 6
21 21 21 3 1999-05-21 00:00:00 6
22 22 22 3 1999-05-22 00:00:00 6
23 23 23 4 1999-05-17 00:00:00 4
24 24 24 4 1999-05-18 00:00:00 4
25 25 25 4 1999-05-19 00:00:00 6
Upvotes: 2