Gabe Wright
Gabe Wright

Reputation: 13

Sum values in R where values in column between two numbers

I'm trying to take a data.frame and aggregate the values of one column, sorted by values in other columns, and where the values in a final column are between certain criteria. In SQL I'd do a simple Group By and write a loop, but I'm just getting started in R and am having a hard time figuring out syntax. Basically I have a dataset that looks like this:

Type    Type2   Bucket  Value
   A    1       1       1
   A    2       1       2
   A    3       1       1
   A    4       1       3
   A    5       1       1
   A    1       2       1
   A    2       2       2
   A    3       2       1
   A    4       2       3

I want the output to be something like this:

Type    Type2   Bucket  Value
A       <4      1       4
A       >=4     1       4
A       <4      2       5
A       >=4     2       3

In my head, it's easy, but I'm coming from a SQL background, and trying to do it in R. I've messed with some functions like split and ddply with bits of success, but can't quite put it all together. Thanks.

Upvotes: 1

Views: 1518

Answers (1)

acylam
acylam

Reputation: 18681

You can do this with dplyr. Assuming you have more than one Type:

library(dplyr)

df %>%
  group_by(Type, Bucket, Type2 = ifelse(Type2 < 4, "<4", ">=4")) %>%
  summarize(Value = sum(Value)) %>%
  select(Type, Type2, Bucket, Value)

Result:

# A tibble: 4 x 4
# Groups:   Type, Bucket [2]
    Type Type2 Bucket Value
  <fctr> <chr>  <int> <int>
1      A    <4      1     4
2      A   >=4      1     4
3      A    <4      2     4
4      A   >=4      2     3

Since you mentioned you have SQL background, here is a sqldf solution for you:

library(sqldf)

sqldf("select Type, 
              case when Type2 < 4 then '<4' else '>=4' end as Type_2,
              Bucket, 
              sum(Value) as Value
          from df
          group by Type, Bucket, Type_2")

Result:

  Type Type_2 Bucket Value
1    A     <4      1     4
2    A    >=4      1     4
3    A     <4      2     4
4    A    >=4      2     3

Data:

df = structure(list(Type = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = "A", class = "factor"), Type2 = c(1L, 2L, 3L, 
4L, 5L, 1L, 2L, 3L, 4L), Bucket = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), Value = c(1L, 2L, 1L, 3L, 1L, 1L, 2L, 1L, 3L)), .Names = c("Type", 
"Type2", "Bucket", "Value"), class = "data.frame", row.names = c(NA, 
-9L))

Upvotes: 1

Related Questions