MetabO
MetabO

Reputation: 11

Factorizing a variable

I am trying to a factorize a variable which is basically a count of how many food species one has eaten. My code is bring out an error and I am not sure how to fix this.

Totals is continous variable with the food count

Foods$count_quintile <- factor(Foods$Totals, levels = 1:5,
                                      labels = c("Q1", "Q2", "Q3", "Q4", "Q5"))

After I run this, count_quintile is still turning up empty.

Thoughts?

Upvotes: 0

Views: 54

Answers (2)

jay.sf
jay.sf

Reputation: 72593

factor isn't quite the right function for this. Obviously you are looking for cut() in which you can specify quintiles using quantile() or whatever for the breaks=.

Foods <- transform(Foods,
                   count_quintile=cut(Totals, 
                                      breaks=quantile(Totals, seq.int(0, 1, length.out=6)), 
                                      include.lowest=TRUE,
                                      labels=paste0('Q', 1:5)))

str(Foods$count_quintile)
# Factor w/ 5 levels "Q1","Q2","Q3",..: 5 3 4 3 1 1 5 5 3 1 ...

head(Foods)
#          foo Totals       bar count_quintile
# 1  1.3709584     17 0.5131505             Q5
# 2 -0.5646982     11 0.4687138             Q3
# 3  0.3631284     13 0.4058770             Q4
# 4  0.6328626     11 0.7304523             Q3
# 5  0.4042683      6 0.6039375             Q1
# 6 -0.1061245      6 0.8713164             Q1

We can do a cross-check:

with(Foods, tapply(Totals, count_quintile, max))
# Q1 Q2 Q3 Q4 Q5 
#  7  9 11 13 21 

with(Foods, quantile(Totals, seq.int(0, 1, length.out=6)))
# 0%  20%  40%  60%  80% 100% 
#  2    7    9   11   13   21 

Data:

Foods <- n <- 1000; set.seed(42); Foods <- data.frame(foo=rnorm(n), Totals=rpois(n, 10), bar=runif(n))

Upvotes: 0

Mark
Mark

Reputation: 12518

Short answer:

You need to round the Totals column data, so it fits into the bins of the factor variable you're trying to create.

Longer answer:

In your question you stated that "Totals is continous variable with the food count". Creating sample data which is continuous:

set.seed(0)

Foods <- data.frame(
    Totals = rnorm(7, 2.5, 1)
)

    Totals
1 3.762954
2 2.173767
3 3.829799
4 3.772429
5 2.914641
6 0.960050
7 1.571433

As you can see, every value is some fraction between 0 and 5, with every number having so many decimal places that it is very very unlikely for any of them to be a round number exactly.

In factor(Foods$Totals, levels = 1:5, labels = c("Q1", "Q2", "Q3", "Q4", "Q5")), the levels = 1:5 part is saying that the input data is coded with the numbers 1 to 5; the first category is 1, the second is 2, third 3, etc.

So when it looks at the Total column data we see above, and it doesn't see any of those values, it returns NA. Then it repeats, creating a column of NAs.

To make it work, (assuming rounding the Total is appropriate for your data!) you can round to the nearest value, and then the code will work:

Foods$count_quintile <- factor(round(Foods$Totals), levels = 1:5, labels = c("Q1", "Q2", "Q3", "Q4", "Q5"))

    Totals count_quintile
1 3.762954             Q4
2 2.173767             Q2
3 3.829799             Q4
4 3.772429             Q4
5 2.914641             Q3
6 0.960050             Q1
7 1.571433             Q2

Upvotes: 0

Related Questions