Reputation: 2663
Quick question. I am binning a variable in a number of different ways for exploratory data analysis. Let's say I have a variable called var
in data.frame df
.
df$var<-c(1,2,8,9,4,5,6,3,6,9,3,4,5,6,7,8,9,2,3,4,6,1,2,3,7,8,9,0)
So far, I've employed the following approaches (code below):
#Divide into quartiles
df$var_quartile <- with(df, cut(var, breaks=quantile(var, probs=seq(0,1, by=.25)), include.lowest=TRUE))
# Values of var_quartile
> [0,3],[0,3],(7.25,9],(7.25,9],(3,5],(3,5],(5,7.25],[0,3],(5,7.25],(7.25,9],[0,3],(3,5],(3,5],(5,7.25],(5,7.25],(7.25,9],(7.25,9],[0,3],[0,3],(3,5],(5,7.25],[0,3],[0,3],[0,3]
#Bin into increments of 2
df$var_bin<- cut(df[['var']],2, include.lowest=TRUE, labels=1:2)
# Values of var_bin
> 1 1 2 2 1 2 2 1 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 1 2 2 2 1
The last thing that I'd like to do is bin the variable into sections of 10 observations after it has been sorted in chronological order. This is an identical approach to splitting after finding the median (counting up to the middle observation), only I want to count in 10-observation increments.
Using my example, this would split var
into the following sections:
0,1,1,2,2,2,3,3,3,3
4,4,4,5,5,6,6,6,6,7
7,8,8,8,9,9,9
N.B. -- I need to run this operation in very large datasets (usually 3-6 million observations in wide form).
How do I do this? Thanks!
Upvotes: 4
Views: 3818
Reputation: 518
I created groups of equal size without using cut.
# number_of_groups_wanted = number of rows / divisor in ceiling code
# therefore divisor in ceiling code should be = number of rows / number_of_groups_wanted,
# divisor in ceiling code = (nrow(df)/number_of_groups_wanted)
# min assigns every tied element to the lowest rank
number_of_groups_wanted = 100 # put in the number of groups you want
df$group = ceiling(rank(df$var_to_group, ties.method = "min")/(nrow(df)/number_of_groups_wanted))
df$rank = rank(df$var_to_group, ties.method = "min") # this line is just used to check data
Upvotes: 1
Reputation: 162321
cut_number()
from ggplot2 is designed to cut a numeric vector into intervals containing equal numbers of points. In your case, you might use it like so:
library(ggplot2)
split(var, cut_number(var, n=3, labels=1:3))
# $`1`
# [1] 1 2 3 3 2 3 1 2 3 0
#
# $`2`
# [1] 4 5 6 6 4 5 6 4 6
#
# $`3`
# [1] 8 9 9 7 8 9 7 8 9
Upvotes: 8
Reputation: 81683
vec <- c(1,2,8,9,4,5,6,3,6,9,3,4,5,6,7,8,9,2,3,4,6,1,2,3,7,8,9,0) # your vector
nObs <- 10 # number of observations per bin
# create data labels
datLabels <- ceiling(seq_along(vec)/nObs)[rank(vec, ties.method = "first")]
# test data labels:
split(vec, datLabels)
$`1`
[1] 1 2 3 3 2 3 1 2 3 0
$`2`
[1] 4 5 6 6 4 5 6 7 4 6
$`3`
[1] 8 9 9 8 9 7 8 9
Upvotes: 4
Reputation: 4208
This should do it.
df$var_bin<- cut(df[['var']], breaks = Size(df$var/10),
include.lowest=TRUE, labels=1:10)
Upvotes: 0
Reputation: 1816
Do you mean something like this?
x <- sample(100)
binSize <- 10
table(floor(x/binSize)*binSize)
Upvotes: 1