Reputation: 2663

Binning variable by set number of observations

Quick question. I am binning a variable in a number of different ways for exploratory data analysis. Let's say I have a variable called var in data.frame df.

df$var<-c(1,2,8,9,4,5,6,3,6,9,3,4,5,6,7,8,9,2,3,4,6,1,2,3,7,8,9,0)

So far, I've employed the following approaches (code below):

#Divide into quartiles
df$var_quartile <- with(df, cut(var, breaks=quantile(var, probs=seq(0,1, by=.25)), include.lowest=TRUE))
# Values of var_quartile
> [0,3],[0,3],(7.25,9],(7.25,9],(3,5],(3,5],(5,7.25],[0,3],(5,7.25],(7.25,9],[0,3],(3,5],(3,5],(5,7.25],(5,7.25],(7.25,9],(7.25,9],[0,3],[0,3],(3,5],(5,7.25],[0,3],[0,3],[0,3]

#Bin into increments of 2
df$var_bin<- cut(df[['var']],2, include.lowest=TRUE, labels=1:2)
# Values of var_bin
> 1 1 2 2 1 2 2 1 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 1 2 2 2 1

The last thing that I'd like to do is bin the variable into sections of 10 observations after it has been sorted in chronological order. This is an identical approach to splitting after finding the median (counting up to the middle observation), only I want to count in 10-observation increments.

Using my example, this would split var into the following sections:

0,1,1,2,2,2,3,3,3,3
4,4,4,5,5,6,6,6,6,7
7,8,8,8,9,9,9

N.B. -- I need to run this operation in very large datasets (usually 3-6 million observations in wide form).

How do I do this? Thanks!

Upvotes: 4

Answers (5)

Sam

Reputation: 518

I created groups of equal size without using cut.

# number_of_groups_wanted  = number of rows / divisor in ceiling code  
# therefore divisor in ceiling code should be =  number of rows / number_of_groups_wanted, 
# divisor in ceiling code = (nrow(df)/number_of_groups_wanted)  
# min assigns every tied element to the lowest rank 
number_of_groups_wanted = 100 # put in the number of groups you want
df$group = ceiling(rank(df$var_to_group, ties.method = "min")/(nrow(df)/number_of_groups_wanted)) 

df$rank = rank(df$var_to_group, ties.method = "min") # this line is just used to check data

Upvotes: 1

Josh O'Brien

Reputation: 162321

cut_number() from ggplot2 is designed to cut a numeric vector into intervals containing equal numbers of points. In your case, you might use it like so:

library(ggplot2)
split(var, cut_number(var, n=3, labels=1:3))
# $`1`
#  [1] 1 2 3 3 2 3 1 2 3 0
# 
# $`2`
# [1] 4 5 6 6 4 5 6 4 6
# 
# $`3`
# [1] 8 9 9 7 8 9 7 8 9

Upvotes: 8

Sven Hohenstein

Reputation: 81683

vec <- c(1,2,8,9,4,5,6,3,6,9,3,4,5,6,7,8,9,2,3,4,6,1,2,3,7,8,9,0) # your vector

nObs <- 10 # number of observations per bin

# create data labels
datLabels <- ceiling(seq_along(vec)/nObs)[rank(vec, ties.method = "first")] 


# test data labels:
split(vec, datLabels)

$`1`
 [1] 1 2 3 3 2 3 1 2 3 0

$`2`
 [1] 4 5 6 6 4 5 6 7 4 6

$`3`
 [1] 8 9 9 8 9 7 8 9

Upvotes: 4

Green Demon

Reputation: 4208

This should do it.

df$var_bin<- cut(df[['var']], breaks = Size(df$var/10), 
                 include.lowest=TRUE, labels=1:10)

Upvotes: 0

Jetse

Reputation: 1816

Do you mean something like this?

x <- sample(100)
binSize <- 10
table(floor(x/binSize)*binSize)

Upvotes: 1

Binning variable by set number of observations

Answers (5)

Related Questions