Reputation: 11
I'm new to R and need some advise solving the following problem: I have a .dbf table in a data frame with dates on which a certain threshold is passed on different spatial locations which are indicated by a "PointID" in that table. It looks like this:
time PointID
04/07/71 X10Y11
04/25/71 X10Y11
04/26/71 X10Y11
05/02/71 X10Y11
05/03/71 X10Y11
05/04/71 X10Y11
05/05/71 X10Y11
05/09/71 X10Y11
05/12/71 X10Y11
05/13/71 X10Y11
05/14/71 X10Y11
05/15/71 X10Y11
05/16/71 X10Y11
05/17/71 X10Y11
05/18/71 X10Y11
05/19/71 X10Y11
05/20/71 X10Y11
05/21/71 X10Y11
05/22/71 X10Y11
05/23/71 X10Y11
05/26/71 X10Y11
10/07/71 X10Y11
10/08/71 X10Y11
10/09/71 X10Y11
10/10/71 X10Y11
10/11/71 X10Y11
10/12/71 X10Y11
10/23/71 X10Y11
10/24/71 X10Y11
10/25/71 X10Y11
10/26/71 X10Y11
10/27/71 X10Y11
10/28/71 X10Y11
11/04/71 X10Y11
03/30/72 X10Y11
04/07/72 X10Y11
04/08/72 X10Y11
04/10/72 X10Y11
04/20/72 X10Y11
04/22/72 X10Y11
04/23/72 X10Y11
04/24/72 X10Y11
04/25/72 X10Y11
04/26/72 X10Y11
04/27/72 X10Y11
04/28/72 X10Y11
04/29/72 X10Y11
04/30/72 X10Y11
05/01/72 X10Y11
05/02/72 X10Y11
05/03/72 X10Y11
05/08/72 X10Y11
05/09/72 X10Y11
05/10/72 X10Y11
05/11/72 X10Y11
10/09/72 X10Y11
10/10/72 X10Y11
10/11/72 X10Y11
10/12/72 X10Y11
10/13/72 X10Y11
10/14/72 X10Y11
10/15/72 X10Y11
10/16/72 X10Y11
10/17/72 X10Y11
10/18/72 X10Y11
10/19/72 X10Y11
01/15/73 X10Y11
01/21/73 X10Y11
03/19/73 X10Y11
03/20/73 X10Y11
03/21/73 X10Y11
03/31/73 X10Y11
04/01/73 X10Y11
04/02/73 X10Y11
04/03/73 X10Y11
04/15/73 X10Y11
03/01/71 X10Y12
04/04/71 X10Y12
04/07/71 X10Y12
04/08/71 X10Y12
04/09/71 X10Y12
04/10/71 X10Y12
04/11/71 X10Y12
04/18/71 X10Y12
04/19/71 X10Y12
04/20/71 X10Y12
04/21/71 X10Y12
04/22/71 X10Y12
04/23/71 X10Y12
04/25/71 X10Y12
04/26/71 X10Y12
04/28/71 X10Y12
05/02/71 X10Y12
05/03/71 X10Y12
05/04/71 X10Y12
05/05/71 X10Y12
05/06/71 X10Y12
05/07/71 X10Y12
05/08/71 X10Y12
05/09/71 X10Y12
05/10/71 X10Y12
07/08/71 X10Y12
07/09/71 X10Y12
07/10/71 X10Y12
07/11/71 X10Y12
07/12/71 X10Y12
11/02/71 X10Y12
11/03/71 X10Y12
11/04/71 X10Y12
02/10/72 X10Y12
02/11/72 X10Y12
03/30/72 X10Y12
04/05/72 X10Y12
04/06/72 X10Y12
04/07/72 X10Y12
04/08/72 X10Y12
04/10/72 X10Y12
04/23/72 X10Y12
04/24/72 X10Y12
04/25/72 X10Y12
04/26/72 X10Y12
04/27/72 X10Y12
04/28/72 X10Y12
04/29/72 X10Y12
04/30/72 X10Y12
05/01/72 X10Y12
05/02/72 X10Y12
05/03/72 X10Y12
05/04/72 X10Y12
05/07/72 X10Y12
05/08/72 X10Y12
05/09/72 X10Y12
05/10/72 X10Y12
05/11/72 X10Y12
05/12/72 X10Y12
05/13/72 X10Y12
05/14/72 X10Y12
05/15/72 X10Y12
05/16/72 X10Y12
05/17/72 X10Y12
08/30/72 X10Y12
08/31/72 X10Y12
09/01/72 X10Y12
09/02/72 X10Y12
09/03/72 X10Y12
09/04/72 X10Y12
09/05/72 X10Y12
09/06/72 X10Y12
Now I'm looking for a way to find the first day of the first group of six consecutive days per year and PointID. The result for the table above would ideally look something like this, with only the dates and PointIDs remaining:
time PointID
05/12/71 X10Y11
04/22/72 X10Y11
04/18/71 X10Y12
04/23/72 X10Y12
The solution should work on large datasets of 7+ millions of rows per table. Does anyone know a solution to this problem and can help me out?
Thank you!
EDIT: variables are as follows
'data.frame': 21071 obs. of 2 variables:
$ time : Date, format: "1971-03-01" "1971-04-04" "1971-04-04" "1971-04-04" ...
$ PointID: Factor w/ 5 levels "X10Y11","X10Y12",..: 2 2 3 4 5 1 2 3 5 2 ...
Upvotes: 0
Views: 69
Reputation: 57210
I don't know if efficient enough, but this is a possible solution in base R :
DF$year <- as.integer(format(DF$time,format='%Y'))
findFirstConsecutiveSixDays <- function(dates){
dates <- sort(dates)
RLE <- rle(as.numeric(dates[-length(dates)] - dates[-1]))
groupOfSixConsec <- which(RLE$values == -1 & RLE$lengths >= 5)
if(length(groupOfSixConsec) == 0)
return(as.Date(NA))
D <- dates[sum(RLE$lengths[1:groupOfSixConsec[1]])-RLE$lengths[groupOfSixConsec[1]]+1]
return(D)
}
Grouped <- aggregate(time ~ year + PointID, DF, FUN=findFirstConsecutiveSixDays)
> Grouped[complete.cases(Grouped),c('time','PointID')]
time PointID
1 1971-05-12 X10Y11
2 1972-04-22 X10Y11
4 1971-04-18 X10Y12
5 1972-04-23 X10Y12
Code to replicate the DF :
DF <- structure(list(time = structure(c(461, 479, 480, 486, 487, 488,
489, 493, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,
507, 510, 644, 645, 646, 647, 648, 649, 660, 661, 662, 663, 664,
665, 672, 819, 827, 828, 830, 840, 842, 843, 844, 845, 846, 847,
848, 849, 850, 851, 852, 853, 858, 859, 860, 861, 1012, 1013,
1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1110, 1116,
1173, 1174, 1175, 1185, 1186, 1187, 1188, 1200, 424, 458, 461,
462, 463, 464, 465, 472, 473, 474, 475, 476, 477, 479, 480, 482,
486, 487, 488, 489, 490, 491, 492, 493, 494, 553, 554, 555, 556,
557, 670, 671, 672, 770, 771, 819, 825, 826, 827, 828, 830, 843,
844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 857, 858,
859, 860, 861, 862, 863, 864, 865, 866, 867, 972, 973, 974, 975,
976, 977, 978, 979), class = "Date"), PointID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("X10Y11", "X10Y12"), class = "factor")), .Names = c("time",
"PointID"), row.names = c(NA, -148L), class = "data.frame")
Upvotes: 1
Reputation: 51582
Here is an idea using tidyverse
,
library(tidyverse)
df %>%
group_by(PointID) %>%
mutate(new = c(NA, diff.difftime(time)), new1 = data.table::rleid(new)) %>%
filter(new == 1) %>%
group_by(PointID, new1) %>%
summarise(cnt = n(), time = first(time)) %>%
filter(cnt >= 5) %>%
mutate(time = time - 1) %>%
group_by(PointID, time1 = format(time, format = '%Y')) %>%
slice(1L) %>%
ungroup() %>%
select(-c(new1, cnt, time1))
# A tibble: 4 x 2
# PointID time
# <fctr> <date>
#1 X10Y11 1971-05-12
#2 X10Y11 1972-04-22
#3 X10Y12 1971-04-18
#4 X10Y12 1972-04-23
Upvotes: 0