Reputation: 23
I want to count the number of rows before and including the the first non-zero per species x date. I have managed to import and sort data, and can return the value of the first non-zero row per site x date, but I cannot calculate the number of rows before the first non-zero. Ecologically, this analysis is trying to determine how many surveys one would need to do (species x date) to record our focal species (values).
I have tried to use the tidyverse
/dplyr
environment to do this, trying summarise()
and n()
, with little success. Any pointers would be appreciated.
Below is an example of data that I have been trying to write this code for:
test_df <- structure(list(site = c("a", "a", "a", "a", "a", "a",
"b", "b", "b", "b", "b", "b",
"c", "c", "c", "c", "c", "c"),
Date = structure(c(17167, 17198, 17226, 17257, 17287,
17318, 17167, 17198, 17226, 17257,
17287, 17318, 17167, 17198,
17226, 17257, 17287, 17318),
class = "Date"), values = c(0, 0, 0, 3, 4, 5, 10, 11, 12, 13, 14, 15, 0, 0, 0, 0, 45, 50)),
row.names = c(NA, -18L), class = "data.frame",
.Names = c("site", "Date", "values"))
This is the code to return the value of the first non-zero row (by species x date):
test_df %>%
# Convert site to factor, so we can use complete later.
# We do this within group_by, because we want to operate by level of site
group_by(site=factor(site)) %>%
# Remove all rows of variable if there aren't any rows with values==0
filter(any(values==0)) %>%
# Remove all rows with values != 0
filter(values != 0) %>%
# Keep the first row of each variable, after sorting by date
# This gives us the first non-zero row
arrange(Date) %>%
slice(1) %>%
# Use complete to bring back a row for any level of variable that
# didn't start with any rows with values==0
ungroup() %>%
complete(site)
Instead of the resulting table looking like this:
# A tibble: 3 x 3
site Date values
<fct> <date> <dbl>
1 a 2017-04-01 3
2 b NA NA
3 c 2017-05-01 45
I want it to return a table with values indicating the number of rows before and including the first row with a non-zero, not the value of the first non-zero, as in the table above:
I.e. For site 'a', we had to survey 4 months(rows) to record our focal species for the first time, site 'b' recorded the focal species during the 1st survey, and site 'c' recorded the focal species on the 5th survey.
# A tibble: 3 x 3
site Date values
<fct> <date> <dbl>
1 a 2017-04-01 4
2 b 2017-01-01 1
3 c 2017-05-01 5
Upvotes: 1
Views: 313
Reputation: 40141
Another dplyr
possibility:
test_df %>%
group_by(site) %>%
mutate(val = ifelse((values != 0 & lag(values, default = 0) == 0) | values == 0, 1, 0)) %>%
summarise(Date = first(Date[values != 0]),
values = sum(val))
Upvotes: 0
Reputation: 12084
A bit more verbose than Jaap. First, I define a function that counts leading zeroes and adds one. It uses the rle
(Run Length Encoding) function.
count0 <- function(x){
tmp <- rle(x)
ifelse(!tmp$values[1], tmp$lengths[1] + 1, 1)
}
Here, I find the date of the first non-zero element, then I apply count0
to count leading zeroes.
test_df %>%
group_by(site) %>%
summarise(Date = Date[(values>0)][1],
values = count0(values))
This gives the required output.
# # A tibble: 3 x 3
# site Date values
# <chr> <date> <dbl>
# 1 a 2017-04-01 4
# 2 b 2017-01-01 1
# 3 c 2017-05-01 5
Upvotes: 0
Reputation: 83255
Using:
test_df %>%
group_by(site) %>%
mutate(n = row_number()) %>%
filter(values != 0) %>%
slice(1)
gives:
# A tibble: 3 x 4 # Groups: site [3] site Date values n <chr> <date> <dbl> <int> 1 a 2017-04-01 3 4 2 b 2017-01-01 10 1 3 c 2017-05-01 45 5
Upvotes: 2