roody
roody

Reputation: 2663

Binning variable by set number of observations

Quick question. I am binning a variable in a number of different ways for exploratory data analysis. Let's say I have a variable called var in data.frame df.

df$var<-c(1,2,8,9,4,5,6,3,6,9,3,4,5,6,7,8,9,2,3,4,6,1,2,3,7,8,9,0)

So far, I've employed the following approaches (code below):

#Divide into quartiles
df$var_quartile <- with(df, cut(var, breaks=quantile(var, probs=seq(0,1, by=.25)), include.lowest=TRUE))
# Values of var_quartile
> [0,3],[0,3],(7.25,9],(7.25,9],(3,5],(3,5],(5,7.25],[0,3],(5,7.25],(7.25,9],[0,3],(3,5],(3,5],(5,7.25],(5,7.25],(7.25,9],(7.25,9],[0,3],[0,3],(3,5],(5,7.25],[0,3],[0,3],[0,3]

#Bin into increments of 2
df$var_bin<- cut(df[['var']],2, include.lowest=TRUE, labels=1:2)
# Values of var_bin
> 1 1 2 2 1 2 2 1 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 1 2 2 2 1

The last thing that I'd like to do is bin the variable into sections of 10 observations after it has been sorted in chronological order. This is an identical approach to splitting after finding the median (counting up to the middle observation), only I want to count in 10-observation increments.

Using my example, this would split var into the following sections:

0,1,1,2,2,2,3,3,3,3
4,4,4,5,5,6,6,6,6,7
7,8,8,8,9,9,9

N.B. -- I need to run this operation in very large datasets (usually 3-6 million observations in wide form).

How do I do this? Thanks!

Upvotes: 4

Views: 3818

Answers (5)

Sam
Sam

Reputation: 518

I created groups of equal size without using cut.

# number_of_groups_wanted  = number of rows / divisor in ceiling code  
# therefore divisor in ceiling code should be =  number of rows / number_of_groups_wanted, 
# divisor in ceiling code = (nrow(df)/number_of_groups_wanted)  
# min assigns every tied element to the lowest rank 
number_of_groups_wanted = 100 # put in the number of groups you want
df$group = ceiling(rank(df$var_to_group, ties.method = "min")/(nrow(df)/number_of_groups_wanted)) 

df$rank = rank(df$var_to_group, ties.method = "min") # this line is just used to check data  

Upvotes: 1

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162321

cut_number() from ggplot2 is designed to cut a numeric vector into intervals containing equal numbers of points. In your case, you might use it like so:

library(ggplot2)
split(var, cut_number(var, n=3, labels=1:3))
# $`1`
#  [1] 1 2 3 3 2 3 1 2 3 0
# 
# $`2`
# [1] 4 5 6 6 4 5 6 4 6
# 
# $`3`
# [1] 8 9 9 7 8 9 7 8 9

Upvotes: 8

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

vec <- c(1,2,8,9,4,5,6,3,6,9,3,4,5,6,7,8,9,2,3,4,6,1,2,3,7,8,9,0) # your vector

nObs <- 10 # number of observations per bin

# create data labels
datLabels <- ceiling(seq_along(vec)/nObs)[rank(vec, ties.method = "first")] 


# test data labels:
split(vec, datLabels)

$`1`
 [1] 1 2 3 3 2 3 1 2 3 0

$`2`
 [1] 4 5 6 6 4 5 6 7 4 6

$`3`
 [1] 8 9 9 8 9 7 8 9

Upvotes: 4

Green Demon
Green Demon

Reputation: 4208

This should do it.

df$var_bin<- cut(df[['var']], breaks = Size(df$var/10), 
                 include.lowest=TRUE, labels=1:10)

Upvotes: 0

Jetse
Jetse

Reputation: 1816

Do you mean something like this?

x <- sample(100)
binSize <- 10
table(floor(x/binSize)*binSize)

Upvotes: 1

Related Questions