prdel99
prdel99

Reputation: 363

R - cut function duplicates values

I have this code

bucket <- seq(0, 100000, by = 5000)

dt <-
  data.frame(sold_amount = bucket) %>% 
  mutate(bucket = cut(bucket, breaks = bucket, include.lowest = T, dig.lab = 10))

If I execute it, bucket [0, 5000] is duplicated, with include.lowest = T bucket for amount 0 is na How can i get bins [0,5000] for sold amount 0 and (5000,10000] for sold amount 5000?

Upvotes: 0

Views: 139

Answers (4)

dash2
dash2

Reputation: 2262

An approach with my santoku package:

library(santoku)

dt$bucket <- chop_width(dt$sold_amount, 5000, labels = lbl_intervals("%d"))
dt
   sold_amount          bucket
1            0       [0, 5000)
2         5000   [5000, 10000)
3        10000  [10000, 15000)
4        15000  [15000, 20000)
5        20000  [20000, 25000)
6        25000  [25000, 30000)
7        30000  [30000, 35000)
8        35000  [35000, 40000)
9        40000  [40000, 45000)
10       45000  [45000, 50000)
11       50000  [50000, 55000)
12       55000  [55000, 60000)
13       60000  [60000, 65000)
14       65000  [65000, 70000)
15       70000  [70000, 75000)
16       75000  [75000, 80000)
17       80000  [80000, 85000)
18       85000  [85000, 90000)
19       90000  [90000, 95000)
20       95000 [95000, 100000)
21      100000        {100000}

Upvotes: 0

Pedro Alencar
Pedro Alencar

Reputation: 1079

Maybe just remove the first row

dt <-
  data.frame(sold_amount = bucket) %>% 
  mutate(bucket = cut(bucket, breaks = bucket, include.lowest = T, dig.lab = 10))%>%
  .[-1,]

dt  

Upvotes: 0

ThomasIsCoding
ThomasIsCoding

Reputation: 101044

Maybe this?

cut(bucket, breaks = c(bucket,Inf), include.lowest = T, right = FALSE, dig.lab = 10)

such that

> dt <-
+   data.frame(sold_amount = bucket) %>%
+   mutate(bucket = cut(bucket, breaks = c(bucket, Inf), include.lowest = T, right = FALSE, dig.lab = .... [TRUNCATED]

> dt
   sold_amount         bucket
1            0       [0,5000)
2         5000   [5000,10000)
3        10000  [10000,15000)
4        15000  [15000,20000)
5        20000  [20000,25000)
6        25000  [25000,30000)
7        30000  [30000,35000)
8        35000  [35000,40000)
9        40000  [40000,45000)
10       45000  [45000,50000)
11       50000  [50000,55000)
12       55000  [55000,60000)
13       60000  [60000,65000)
14       65000  [65000,70000)
15       70000  [70000,75000)
16       75000  [75000,80000)
17       80000  [80000,85000)
18       85000  [85000,90000)
19       90000  [90000,95000)
20       95000 [95000,100000)
21      100000   [100000,Inf]

Upvotes: 2

nevrome
nevrome

Reputation: 1550

A pragmatic approach would be to just filter out the offending line:

library(tidyverse)

bucket <- seq(0, 100000, by = 5000)

dt <-
  data.frame(sold_amount = bucket) %>% 
  mutate(bucket = cut(bucket, breaks = bucket, include.lowest = T, dig.lab = 10)) %>%
  dplyr::filter(sold_amount != 0)

> dt
   sold_amount         bucket
1         5000       [0,5000]
2        10000   (5000,10000]
3        15000  (10000,15000]
4        20000  (15000,20000]
5        25000  (20000,25000]
6        30000  (25000,30000]
7        35000  (30000,35000]
8        40000  (35000,40000]
9        45000  (40000,45000]
10       50000  (45000,50000]
11       55000  (50000,55000]
12       60000  (55000,60000]
13       65000  (60000,65000]
14       70000  (65000,70000]
15       75000  (70000,75000]
16       80000  (75000,80000]
17       85000  (80000,85000]
18       90000  (85000,90000]
19       95000  (90000,95000]
20      100000 (95000,100000]

Upvotes: 0

Related Questions