Reputation: 453

R: finding the maximum number of rows between NA values

I’m working with a dataset that contains GPS locations for a small group of polar bears. For every bear, there should theoretically be one location every 4 hours, but unfortunately the radio collars don’t always work perfectly and there are gaps in my data.

My goal is to produce a csv that subsets the maximum number of locations between gaps for each bear. For example, if a bear’s data is composed of 100 locations, then has one gap, and then 50 locations, I only want to subset the first 100 locations in the final csv.

Here is a code to generate the kind of dataset I would use:

bears<-as.character(c(rep("bear1",times=5),rep("bear2",times=5)))
time<-c("2007-09-08 13:00:00","NA","2007-09-08 21:00:00","2007-09-09 1:00:00","NA","NA","2007-10-09 17:00:00","2007-10-09 1:00:00","NA","2007-10-09 9:00:00")
bear.data<-data.frame(bears,time)

Where:

bears refers to the individual bear.
time refers to the time at which a particular location is transmitted. When the collar fails to transmit a GPS location, this column has a value of NA.

Any help would be appreciated!!

Upvotes: 0

Answers (4)

Onyambu

Reputation: 79238

if you were to do this in Base R,

first write a Mode function(Returns the most occurring element):

Mode <- function(x){
  y <- unique(x)
  y[which.max(tabulate(match(x,y)))]
}

Now write a logical function that will give the maximum ids:

max_ids <- function(x){
  id <- with(rle(x),rep(seq_along(values),lengths))
  id == Mode(id) # Uses the mode function above
}

Use the two functions as follows:

subset(bear.data, ave(is.na(as.Date(time)), bears, FUN = max_ids))
  bears                time
3 bear1 2007-09-08 21:00:00
4 bear1  2007-09-09 1:00:00
7 bear2 2007-10-09 17:00:00
8 bear2  2007-10-09 1:00:00

Upvotes: 0

py_b

Reputation: 189

You can create a function that calculates the rows of the longest non-NA sequence for a bear. This function is based on rle() and is.na() :

seq_max <- function(x) {

  r <- rle(!is.na(x))
  rd <- as.data.frame(unclass(r))
  
  rd$ends <- cumsum(rd$lengths)
  rd$starts <- c(1, rd$ends[-length(rd$ends)] + 1)
  rd <- rd[rd$values, ]
  rd <- rd[which.max(rd$lengths)[1], ]
  
  seq(rd$starts, rd$ends)
  
}

Then you apply it to each bear. This is very convenient with dplyr :

library(dplyr)

bear.data %>%
  group_by(bears) %>%
  slice(seq_max(time))

Upvotes: 1

slava-kohut

Reputation: 4233

The problem can be thought of as finding the maximum length of blocks of boolean values per group:

bear.data$time <- as.Date(bear.data$time)
bear.data$not_na <- !is.na(bear.data$time)
bear.data$gap <- ave(bear.data$not_na, cumsum(!bear.data$not_na), FUN = cumsum)
aggregate(gap ~ bears, FUN = max, data=bear.data)

Output

> aggregate(gap ~ bears, FUN = max, data=bear.data)
  bears gap
1 bear1   2
2 bear2   3

Data

 bears                time
1  bear1 2007-09-08 13:00:00
2  bear1                  NA
3  bear1 2007-09-08 21:00:00
4  bear1  2007-09-09 1:00:00
5  bear1                  NA
6  bear2                  NA
7  bear2 2007-10-09 17:00:00
8  bear2 2007-10-09 17:00:00
9  bear2  2007-10-09 1:00:00
10 bear2                  NA
11 bear2  2007-10-09 9:00:00

Upvotes: 1

Jakub.Novotny

Reputation: 3047

bear.data <- data.frame(bears, time) %>%
  mutate(time = ymd_hms(time),
         helper = floor_date(time, unit = "year"),
         seq = rleid(helper)) %>%
  filter(!is.na(helper)) %>%
  group_by(bears, seq) %>%
  add_tally() %>% ungroup() %>%
  group_by(bears) %>%
  slice_max(n)

Upvotes: 1

R: finding the maximum number of rows between NA values

Answers (4)

Related Questions