Bruno Avila
Bruno Avila

Reputation: 296

Create ranges by accumulating values

I have the DF1:

KEY <- c(11,12,22,33,44,55,66,77,88,99,1010,1111,1212,1313,1414,1515,1616,1717,1818,1919,2020)
PRICE <- c(0,0,1,5,7,10,20,80,110,111,200,1000,2500,2799,3215,4999,7896,8968,58914,78422,96352)
DF1 <- data.frame(KEY,PRICE)

I want to group DF1 into ranges, accumulating the values of the two columns (count the KEY column and sum the PRICE column). This is the result I hope for:

INTERVAL <-c('0','UP_TO_10','UP_TO_100','UP_TO_1000','UP_TO_5000','UP_TO_10000','UP_TO_100000')
COUNT_KEY <-c(2,6,8,12,16,18,21)
SUM_PRICE <- c(0,23,123,1544,15057,31921,265609)
DF2 <- data.frame(INTERVAL,COUNT_KEY,SUM_PRICE)

How do I make this table?

Upvotes: 0

Views: 90

Answers (3)

StupidWolf
StupidWolf

Reputation: 47008

You have to manually define you boundaries first:

X = c(-Inf,0,10,100,1000,5000,10000,100000)

Then you use cut to assign to entries to your labels. And we first summarize the counts and total price within the intervals.

library(dplyr)

DF1 %>% 
mutate(LABELS = cut(DF1$PRICE,X,INTERVAL,include.lowest =TRUE)) %>%
group_by(LABELS) %>% 
summarise(COUNT_KEY=n(),SUM_PRICE=sum(PRICE)) 

    # A tibble: 7 x 3
  LABELS       COUNT_KEY SUM_PRICE
  <fct>            <int>     <dbl>
1 0                    2         0
2 UP_TO_10             4        23
3 UP_TO_100            2       100
4 UP_TO_1000           4      1421
5 UP_TO_5000           4     13513
6 UP_TO_10000          2     16864
7 UP_TO_100000         3    233688

This is close to what you want, except the sum_price and counts, should be cumulative. So this can be achieved by doing mutate_if(is.numeric,cumsum):

DF1 %>% 
mutate(LABELS = cut(DF1$PRICE,X,INTERVAL,include.lowest =TRUE)) %>% group_by(LABELS) %>% 
summarise(COUNT_KEY=n(),SUM_PRICE=sum(PRICE)) %>% 
mutate_if(is.numeric,cumsum)

To give:

    # A tibble: 7 x 3
  LABELS       COUNT_KEY SUM_PRICE
  <fct>            <int>     <dbl>
1 0                    2         0
2 UP_TO_10             6        23
3 UP_TO_100            8       123
4 UP_TO_1000          12      1544
5 UP_TO_5000          16     15057
6 UP_TO_10000         18     31921
7 UP_TO_100000        21    265609

Upvotes: 2

cjtexas
cjtexas

Reputation: 41

Okay, here's an all-in-one, tidy way to handle this using dplyr ;)

library(dplyr)

DF1 %>%
  mutate(                                 
    INTERVAL =
      factor(
        case_when(                          # create discrete variable 
          PRICE == 0      ~ '0',
          PRICE <= 10     ~ 'UP_TO_10',
          PRICE <= 100    ~ 'UP_TO_100',
          PRICE <= 1000   ~ 'UP_TO_1000',
          PRICE <= 5000   ~ 'UP_TO_5000',
          PRICE <= 10000  ~ 'UP_TO_10000',
          PRICE <= 100000 ~ 'UP_TO_100000'
        ),
        levels =                            # set the factor levels
          c(
            '0',
            'UP_TO_10',
            'UP_TO_100',
            'UP_TO_1000',
            'UP_TO_5000',
            'UP_TO_10000',
            'UP_TO_100000'
            )
        )
  ) %>% 
  group_by(INTERVAL) %>%                    # create desired group
  summarise(                                # and summary variables
    COUNT_KEY = n(),
    SUM_PRICE = sum(PRICE)
  ) %>%
  mutate(                                   # cumulative totals
    COUNT_KEY_CUM = cumsum(COUNT_KEY),
    SUM_PRICE_CUM = cumsum(SUM_PRICE)
  )

Upvotes: 1

Ben
Ben

Reputation: 30559

If you have a vector of limits or thresholds, such as:

LIMITS <- c(0, 10, 100, 1000, 5000, 10000, 100000)

You could obtain a count of rows where PRICE is below each limit:

unlist(lapply(LIMITS, function(x) sum(DF1$PRICE <= x)))
[1]  2  6  8 12 16 18 21

And a sum of these prices as well:

unlist(lapply(LIMITS, function(x) sum(DF1$PRICE[DF1$PRICE <= x])))
[1]      0     23    123   1544  15057  31921 265609

Is this what you had in mind?

This is everything all together:

LIMITS <- c(0, 10, 100, 1000, 5000, 10000, 100000)
COUNT_KEY <- unlist(lapply(LIMITS, function(x) sum(DF1$PRICE <= x)))
SUM_PRICE <- unlist(lapply(LIMITS, function(x) sum(DF1$PRICE[DF1$PRICE <= x])))
data.frame(INTERVAL = c(0, paste("UP_TO", LIMITS[-1], sep="_")), COUNT_KEY, SUM_PRICE)

      INTERVAL COUNT_KEY SUM_PRICE
1            0         2         0
2     UP_TO_10         6        23
3    UP_TO_100         8       123
4   UP_TO_1000        12      1544
5   UP_TO_5000        16     15057
6  UP_TO_10000        18     31921
7 UP_TO_100000        21    265609

Upvotes: 2

Related Questions