KVSEA
KVSEA

Reputation: 109

cut function produces uneven first break

I'm exploring the use of the cut function and am trying to cut the following basic vector into 10 breaks. I'm able to do it, but I'm confused as to why my initial break occurs at -0.1 rather than 0:

test_vec <- 0:10
test_vec2 <- cut(test_vec, breaks = 10)
test_vec2

yields:

(-0.01,1] (-0.01,1] (1,2]     (2,3]     (3,4]     (4,5]     (5,6]     (6,7]     (7,8]     (8,9]    (9,10]

Why does this produce 2 instances of (-0.01,1] (-0.01,1] and the lower number does not start at 0?

Upvotes: 3

Views: 455

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226057

tl;dr to get what you might want, you'll probably need to specify breaks explicitly, and include.lowest=TRUE:

cut(x,breaks=0:10,include.lowest=TRUE)

The issue is probably this, from the "Details" of ?cut:

When ‘breaks’ is specified as a single number, the range of the data is divided into ‘breaks’ pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals.

Since the range is (0,10), the outer limits are (-0.01, 10.01); as @Onyambu suggests, the results are asymmetric because the value at 0 lies on the left-hand boundary (not included) whereas the value at 10 lies on the right-hand boundary (included).

The (apparent) asymmetry is due to formatting; if you follow the code below (the core of base:::cut.default(), you'll see that the top break is actually at 10.01, but gets formatted as "10" because the default number of digits is 3 ...

x <- 0:10
breaks <- 10
dig <- 3
nb <- as.integer(breaks+1)
dx <- diff(rx <- range(x, na.rm = TRUE))
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] +  dx/1000)
ch.br <- formatC(0 + breaks, digits = dig, width = 1L)

Upvotes: 3

Related Questions