Chr.Hnm
Chr.Hnm

Reputation: 11

Find first day of first group of six consecutive days per year and coordinate?

I'm new to R and need some advise solving the following problem: I have a .dbf table in a data frame with dates on which a certain threshold is passed on different spatial locations which are indicated by a "PointID" in that table. It looks like this:

time    PointID
04/07/71    X10Y11
04/25/71    X10Y11
04/26/71    X10Y11
05/02/71    X10Y11
05/03/71    X10Y11
05/04/71    X10Y11
05/05/71    X10Y11
05/09/71    X10Y11
05/12/71    X10Y11
05/13/71    X10Y11
05/14/71    X10Y11
05/15/71    X10Y11
05/16/71    X10Y11
05/17/71    X10Y11
05/18/71    X10Y11
05/19/71    X10Y11
05/20/71    X10Y11
05/21/71    X10Y11
05/22/71    X10Y11
05/23/71    X10Y11
05/26/71    X10Y11
10/07/71    X10Y11
10/08/71    X10Y11
10/09/71    X10Y11
10/10/71    X10Y11
10/11/71    X10Y11
10/12/71    X10Y11
10/23/71    X10Y11
10/24/71    X10Y11
10/25/71    X10Y11
10/26/71    X10Y11
10/27/71    X10Y11
10/28/71    X10Y11
11/04/71    X10Y11
03/30/72    X10Y11
04/07/72    X10Y11
04/08/72    X10Y11
04/10/72    X10Y11
04/20/72    X10Y11
04/22/72    X10Y11
04/23/72    X10Y11
04/24/72    X10Y11
04/25/72    X10Y11
04/26/72    X10Y11
04/27/72    X10Y11
04/28/72    X10Y11
04/29/72    X10Y11
04/30/72    X10Y11
05/01/72    X10Y11
05/02/72    X10Y11
05/03/72    X10Y11
05/08/72    X10Y11
05/09/72    X10Y11
05/10/72    X10Y11
05/11/72    X10Y11
10/09/72    X10Y11
10/10/72    X10Y11
10/11/72    X10Y11
10/12/72    X10Y11
10/13/72    X10Y11
10/14/72    X10Y11
10/15/72    X10Y11
10/16/72    X10Y11
10/17/72    X10Y11
10/18/72    X10Y11
10/19/72    X10Y11
01/15/73    X10Y11
01/21/73    X10Y11
03/19/73    X10Y11
03/20/73    X10Y11
03/21/73    X10Y11
03/31/73    X10Y11
04/01/73    X10Y11
04/02/73    X10Y11
04/03/73    X10Y11
04/15/73    X10Y11
03/01/71    X10Y12
04/04/71    X10Y12
04/07/71    X10Y12
04/08/71    X10Y12
04/09/71    X10Y12
04/10/71    X10Y12
04/11/71    X10Y12
04/18/71    X10Y12
04/19/71    X10Y12
04/20/71    X10Y12
04/21/71    X10Y12
04/22/71    X10Y12
04/23/71    X10Y12
04/25/71    X10Y12
04/26/71    X10Y12
04/28/71    X10Y12
05/02/71    X10Y12
05/03/71    X10Y12
05/04/71    X10Y12
05/05/71    X10Y12
05/06/71    X10Y12
05/07/71    X10Y12
05/08/71    X10Y12
05/09/71    X10Y12
05/10/71    X10Y12
07/08/71    X10Y12
07/09/71    X10Y12
07/10/71    X10Y12
07/11/71    X10Y12
07/12/71    X10Y12
11/02/71    X10Y12
11/03/71    X10Y12
11/04/71    X10Y12
02/10/72    X10Y12
02/11/72    X10Y12
03/30/72    X10Y12
04/05/72    X10Y12
04/06/72    X10Y12
04/07/72    X10Y12
04/08/72    X10Y12
04/10/72    X10Y12
04/23/72    X10Y12
04/24/72    X10Y12
04/25/72    X10Y12
04/26/72    X10Y12
04/27/72    X10Y12
04/28/72    X10Y12
04/29/72    X10Y12
04/30/72    X10Y12
05/01/72    X10Y12
05/02/72    X10Y12
05/03/72    X10Y12
05/04/72    X10Y12
05/07/72    X10Y12
05/08/72    X10Y12
05/09/72    X10Y12
05/10/72    X10Y12
05/11/72    X10Y12
05/12/72    X10Y12
05/13/72    X10Y12
05/14/72    X10Y12
05/15/72    X10Y12
05/16/72    X10Y12
05/17/72    X10Y12
08/30/72    X10Y12
08/31/72    X10Y12
09/01/72    X10Y12
09/02/72    X10Y12
09/03/72    X10Y12
09/04/72    X10Y12
09/05/72    X10Y12
09/06/72    X10Y12

Now I'm looking for a way to find the first day of the first group of six consecutive days per year and PointID. The result for the table above would ideally look something like this, with only the dates and PointIDs remaining:

time    PointID
05/12/71    X10Y11
04/22/72    X10Y11
04/18/71    X10Y12
04/23/72    X10Y12

The solution should work on large datasets of 7+ millions of rows per table. Does anyone know a solution to this problem and can help me out?

Thank you!

EDIT: variables are as follows

'data.frame':   21071 obs. of  2 variables:
 $ time   : Date, format: "1971-03-01" "1971-04-04" "1971-04-04" "1971-04-04" ...
 $ PointID: Factor w/ 5 levels "X10Y11","X10Y12",..: 2 2 3 4 5 1 2 3 5 2 ...

Upvotes: 0

Views: 69

Answers (2)

digEmAll
digEmAll

Reputation: 57210

I don't know if efficient enough, but this is a possible solution in base R :

DF$year <- as.integer(format(DF$time,format='%Y'))

findFirstConsecutiveSixDays <- function(dates){
  dates <- sort(dates)
  RLE <- rle(as.numeric(dates[-length(dates)] - dates[-1]))
  groupOfSixConsec <- which(RLE$values == -1 & RLE$lengths >= 5)
  if(length(groupOfSixConsec) == 0)
    return(as.Date(NA))
  D <- dates[sum(RLE$lengths[1:groupOfSixConsec[1]])-RLE$lengths[groupOfSixConsec[1]]+1]
  return(D)
}

Grouped <- aggregate(time ~ year + PointID, DF, FUN=findFirstConsecutiveSixDays)

> Grouped[complete.cases(Grouped),c('time','PointID')]
        time PointID
1 1971-05-12  X10Y11
2 1972-04-22  X10Y11
4 1971-04-18  X10Y12
5 1972-04-23  X10Y12

Code to replicate the DF :

DF <- structure(list(time = structure(c(461, 479, 480, 486, 487, 488, 
489, 493, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 
507, 510, 644, 645, 646, 647, 648, 649, 660, 661, 662, 663, 664, 
665, 672, 819, 827, 828, 830, 840, 842, 843, 844, 845, 846, 847, 
848, 849, 850, 851, 852, 853, 858, 859, 860, 861, 1012, 1013, 
1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1110, 1116, 
1173, 1174, 1175, 1185, 1186, 1187, 1188, 1200, 424, 458, 461, 
462, 463, 464, 465, 472, 473, 474, 475, 476, 477, 479, 480, 482, 
486, 487, 488, 489, 490, 491, 492, 493, 494, 553, 554, 555, 556, 
557, 670, 671, 672, 770, 771, 819, 825, 826, 827, 828, 830, 843, 
844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 857, 858, 
859, 860, 861, 862, 863, 864, 865, 866, 867, 972, 973, 974, 975, 
976, 977, 978, 979), class = "Date"), PointID = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L), .Label = c("X10Y11", "X10Y12"), class = "factor")), .Names = c("time", 
"PointID"), row.names = c(NA, -148L), class = "data.frame")

Upvotes: 1

Sotos
Sotos

Reputation: 51582

Here is an idea using tidyverse,

library(tidyverse)

df %>% 
  group_by(PointID) %>% 
  mutate(new = c(NA, diff.difftime(time)), new1 = data.table::rleid(new)) %>% 
  filter(new == 1) %>% 
  group_by(PointID, new1) %>% 
  summarise(cnt = n(), time = first(time)) %>% 
  filter(cnt >= 5) %>% 
  mutate(time = time - 1) %>% 
  group_by(PointID, time1 = format(time, format = '%Y')) %>% 
  slice(1L) %>% 
  ungroup() %>% 
  select(-c(new1, cnt, time1))

# A tibble: 4 x 2
#  PointID       time
#   <fctr>     <date>
#1  X10Y11 1971-05-12
#2  X10Y11 1972-04-22
#3  X10Y12 1971-04-18
#4  X10Y12 1972-04-23

Upvotes: 0

Related Questions