Andreas
Andreas

Reputation: 6728

Is there a canonical 'correct' way to make calculations based on factor levels?

Ok so I've read this question Confusion between factor levels and factor labels. But still feel like I am missing a lot. So this is maybe not a question per se - more like a presentation of my frustration.

Sample data

sample <- dput(structure(list(Logistik_1 = structure(c(3L, 2L, 3L, 3L, 3L, 4L), .Label = c("I meget ringe grad", "I ringe grad", "I nogen grad", "I høj grad", "I meget høj grad"), class = "factor"),
                              Logistik_2 = structure(c(4L, 4L, 4L, 3L, 3L, 4L), .Label = c("I meget ringe grad", "I ringe grad", "I nogen grad", "I høj grad", "I meget høj grad"), class = "factor"),
                              Logistik_3 = structure(c(3L, 4L, 3L, 4L, 3L, 4L), .Label = c("I meget ringe grad", "I ringe grad", "I nogen grad", "I høj grad", "I meget høj grad"), class = "factor"),
                              Logistik_4 = structure(c(4L, 2L, 3L, 4L, 2L, 3L), .Label = c("I meget ringe grad", "I ringe grad", "I nogen grad", "I høj grad", "I meget høj grad"), class = "factor")),
                         .Names = c("Logistik_1","Logistik_2", "Logistik_3", "Logistik_4"), row.names = c(NA, 6L), class = "data.frame"))

The output of sample shows me the labels.

    Logistik_1   Logistik_2   Logistik_3   Logistik_4
1 I nogen grad   I høj grad I nogen grad   I høj grad
2 I ringe grad   I høj grad   I høj grad I ringe grad
3 I nogen grad   I høj grad I nogen grad I nogen grad
4 I nogen grad I nogen grad   I høj grad   I høj grad
5 I nogen grad I nogen grad I nogen grad I ringe grad
6   I høj grad   I høj grad   I høj grad I nogen grad

I can not make calculations with these nominal data rowSums(sample):

Error in rowSums(sample) : 'x' must be numeric

I can change each and single variable to a numeric. E.g. if I want to add all the integer values I can do this: sample$test <- as.numeric(sample[[1]])+as.numeric(sample[[2]])+as.numeric(sample[[3]])+as.numeric(sample[[4]]) which will work. But its lot of typing I think?

However: If I cbind the columns, the output returns the levels: Output of with(sample, cbind(Logistik_1, Logistik_2)):

     Logistik_1 Logistik_2
[1,]          3          4
[2,]          2          4
[3,]          3          4
[4,]          3          3
[5,]          3          3
[6,]          4          4

And I can make calculations on these levelse. E.g. if I want to add all the integer values I can do this: sample$total_score <-with(sample, rowSums(cbind(Logistik_1, Logistik_2, Logistik_3, Logistik_4))) [a]

    Logistik_1   Logistik_2   Logistik_3   Logistik_4 total_score
1 I nogen grad   I høj grad I nogen grad   I høj grad          14
2 I ringe grad   I høj grad   I høj grad I ringe grad          12
3 I nogen grad   I høj grad I nogen grad I nogen grad          13
4 I nogen grad I nogen grad   I høj grad   I høj grad          14
5 I nogen grad I nogen grad I nogen grad I ringe grad          11
6   I høj grad   I høj grad   I høj grad I nogen grad          15

But I am confused, and think I am doing something which is simple too complicated. Is there a canonical 'correct' way to make calculations on factor levels? Is as.numeric more correct than cbind? And why does cbind work like this to begin with?

My hope was something like this would work: sum(as.numeric(sample[1:4])) - but that returns Error: (list) object cannot be coerced to type 'double' (because I am calling as.numeric on dataframe).

[a] I am aware that most statisticians will frown upon the common practice of assigning integer values to survey responses (e.g. "Highly agree" =5, "Agree somewhat" = 4 etc.) - but please just accept that's how we do it in the social sciences :-).The labels are responses in a survey and the levels are the integer values assigned to those responses.

Upvotes: 2

Views: 235

Answers (3)

IRTFM
IRTFM

Reputation: 263372

The other respondents have clearly laid out the case against doing arithmetic on factors, but if such coercion were meaningful (say by having some ordinal interpretation), then this code which coerces to a matrix, would be reasonably compact:

> rowSums(data.matrix(sample))
 1  2  3  4  5  6 
14 12 13 14 11 15 

It would not alter the value of sample. BTW there is a very useful function named sample so it would be better if you avoid the use of that particularly name while coding.

Upvotes: 4

Alexander Hanysz
Alexander Hanysz

Reputation: 801

The theory is that if you're storing something as a factor, then you don't want to do calculations on it! What does it mean to add the numbers? Why should "Highly agree"+"Neither agree nor disagree" equal 8?


Instead of

sample$total_score <-with(sample, rowSums(cbind(Logistik_1, Logistik_2, Logistik_3, Logistik_4)))

you might prefer to use something like

sample$total_score <- sapply(1:nrow(sample),function(n) sum(as.numeric(sample[n,])))

so that you don't have to type the names of all the columns.

Upvotes: 3

Hong Ooi
Hong Ooi

Reputation: 57686

The fact that you can convert factor variables to integer isn't something you should consider as useful for analytical purposes. R stores factors internally as integers, with each number corresponding to a different level: this is simply more efficient than replicating the factor labels for every observation. But those numbers don't necessarily correspond to anything that makes sense in the outside world, and by default they're assigned simply by sorting the labels in alphabetical order.

So yes, you can do arithmetic on factors by converting them to integers. That doesn't mean you should do it. If you want to analyse ordinal data like Likert scales, use functions designed for the purpose.

Upvotes: 4

Related Questions