Reputation: 23
New to r and I'm having difficulty getting the counts I'm after. I have a dataset that contains several columns of various counts per year. Here is an example:
huc_code_eight | year | count_1 | count_2 |
---|---|---|---|
6010105 | 1946 | 4 | 4 |
6010105 | 1947 | 6 | 0 |
6010105 | 1948 | 2 | 0 |
6010105 | 1957 | 4 | 4 |
6020001 | 1957 | 2 | 0 |
8010203 | 1957 | 0 | 0 |
I want to aggregate these counts based upon consecutive years, grouped by huc_code_eight
. The expected output would look like:
huc_code_eight | year | count_1 | count_2 |
---|---|---|---|
6010105 | 1946 - 1948 | 12 | 4 |
6010105 | 1957 | 4 | 4 |
6020001 | 1957 | 2 | 0 |
8010203 | 1957 | 0 | 0 |
I would like to avoid iterating through the data and summing these manually, but, though I've found many examples of aggregating in r, I've been unable to successfully refactor them to fit my use case.
Any help would be greatly appreciated!
Upvotes: 2
Views: 62
Reputation: 24722
Here is a data.table approach
dat <- setDT(dat)[, yr:= year-shift(year),by=huc_code_eight][is.na(yr), yr:=1][,grp:=rleid(huc_code_eight,yr)]
dat[,.(
year = fifelse(.N>1,paste0(min(year),"-",max(year)),paste0(year, collapse="")),
count_1=sum(count_1),count_2=sum(count_2)),
by=.(grp,huc_code_eight)][,grp:=NULL][]
Output:
huc_code_eight year count_1 count_2
1: 6010105 1946-1948 12 4
2: 6010105 1957 4 4
3: 6020001 1957 2 0
4: 8010203 1957 0 0
Upvotes: 2
Reputation: 887078
We can create a grouping column based on diff
erence of adjacent elements in 'year' along with 'huc_code_eight' and then summarise
library(dplyr)
library(stringr)
df1 %>%
group_by(huc_code_eight) %>%
mutate(year_grp = cumsum(c(TRUE, diff(year) != 1))) %>%
group_by(year_grp, .add = TRUE) %>%
summarise(year = if(n() > 1)
str_c(range(year), collapse = ' - ') else as.character(year),
across(starts_with('count'), sum, na.rm = TRUE), .groups = 'drop') %>%
dplyr::select(-year_grp)
-output
# A tibble: 4 × 4
huc_code_eight year count_1 count_2
<int> <chr> <int> <int>
1 6010105 1946 - 1948 12 4
2 6010105 1957 4 4
3 6020001 1957 2 0
4 8010203 1957 0 0
df1 <- structure(list(huc_code_eight = c(6010105L, 6010105L, 6010105L,
6010105L, 6020001L, 8010203L), year = c(1946L, 1947L, 1948L,
1957L, 1957L, 1957L), count_1 = c(4L, 6L, 2L, 4L, 2L, 0L), count_2 = c(4L,
0L, 0L, 4L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 0