Reputation: 453
I’m working with a dataset that contains GPS locations for a small group of polar bears. For every bear, there should theoretically be one location every 4 hours, but unfortunately the radio collars don’t always work perfectly and there are gaps in my data.
My goal is to produce a csv that subsets the maximum number of locations between gaps for each bear. For example, if a bear’s data is composed of 100 locations, then has one gap, and then 50 locations, I only want to subset the first 100 locations in the final csv.
Here is a code to generate the kind of dataset I would use:
bears<-as.character(c(rep("bear1",times=5),rep("bear2",times=5)))
time<-c("2007-09-08 13:00:00","NA","2007-09-08 21:00:00","2007-09-09 1:00:00","NA","NA","2007-10-09 17:00:00","2007-10-09 1:00:00","NA","2007-10-09 9:00:00")
bear.data<-data.frame(bears,time)
Where:
bears refers to the individual bear.
time refers to the time at which a particular location is transmitted. When the collar fails to transmit a GPS location, this column has a value of NA.
Any help would be appreciated!!
Upvotes: 0
Views: 87
Reputation: 79238
if you were to do this in Base R,
first write a Mode function(Returns the most occurring element):
Mode <- function(x){
y <- unique(x)
y[which.max(tabulate(match(x,y)))]
}
Now write a logical function that will give the maximum ids:
max_ids <- function(x){
id <- with(rle(x),rep(seq_along(values),lengths))
id == Mode(id) # Uses the mode function above
}
Use the two functions as follows:
subset(bear.data, ave(is.na(as.Date(time)), bears, FUN = max_ids))
bears time
3 bear1 2007-09-08 21:00:00
4 bear1 2007-09-09 1:00:00
7 bear2 2007-10-09 17:00:00
8 bear2 2007-10-09 1:00:00
Upvotes: 0
Reputation: 189
You can create a function that calculates the rows of the longest non-NA sequence for a bear. This function is based on rle() and is.na() :
seq_max <- function(x) {
r <- rle(!is.na(x))
rd <- as.data.frame(unclass(r))
rd$ends <- cumsum(rd$lengths)
rd$starts <- c(1, rd$ends[-length(rd$ends)] + 1)
rd <- rd[rd$values, ]
rd <- rd[which.max(rd$lengths)[1], ]
seq(rd$starts, rd$ends)
}
Then you apply it to each bear. This is very convenient with dplyr :
library(dplyr)
bear.data %>%
group_by(bears) %>%
slice(seq_max(time))
Upvotes: 1
Reputation: 4233
The problem can be thought of as finding the maximum length of blocks of boolean values per group:
bear.data$time <- as.Date(bear.data$time)
bear.data$not_na <- !is.na(bear.data$time)
bear.data$gap <- ave(bear.data$not_na, cumsum(!bear.data$not_na), FUN = cumsum)
aggregate(gap ~ bears, FUN = max, data=bear.data)
Output
> aggregate(gap ~ bears, FUN = max, data=bear.data)
bears gap
1 bear1 2
2 bear2 3
Data
bears time
1 bear1 2007-09-08 13:00:00
2 bear1 NA
3 bear1 2007-09-08 21:00:00
4 bear1 2007-09-09 1:00:00
5 bear1 NA
6 bear2 NA
7 bear2 2007-10-09 17:00:00
8 bear2 2007-10-09 17:00:00
9 bear2 2007-10-09 1:00:00
10 bear2 NA
11 bear2 2007-10-09 9:00:00
Upvotes: 1
Reputation: 3047
bear.data <- data.frame(bears, time) %>%
mutate(time = ymd_hms(time),
helper = floor_date(time, unit = "year"),
seq = rleid(helper)) %>%
filter(!is.na(helper)) %>%
group_by(bears, seq) %>%
add_tally() %>% ungroup() %>%
group_by(bears) %>%
slice_max(n)
Upvotes: 1