ddaya11
ddaya11

Reputation: 3

How to display the factor names in my descriptive statistics instead of the numbers in R

In the DataAnalyst data (from Kaggle), I am trying to show descriptive statistics of Ratings (numeric value) by state (categorical factor). I am able to successfully display everything but the state names which shows up as #s:

m<-aggregate(Rating~state, data=df,mean)
sd<-aggregate(Rating~state, data=df,sd)
n<-aggregate(Rating~state, data=df,length)
##summary descriptive table
(df.des <- cbind(n[,1], n=n[,2], mean=m[,2], sd=round(sd[,2],3),se=round(sd[,2]/sqrt(n[,2]),3)))

For df.des, I understand the n[,1] displays the # col you want. I have tried n[,2] which brings up the number/state. How can I get the table to display the names and not the numbers? P.S. "State" is listed as characters (e.g. CA, NY, IL) and not numbers.

descriptive statistics with numbers instead of state categories

What n looks like

Upvotes: 0

Views: 191

Answers (2)

akrun
akrun

Reputation: 887621

cbind by default uses cbind.matrix instead of cbind.data.frame and matrix can have only a single class. So, if we there is any column with character, it converts to character class. If the column is factor, because it is stored as integer, it gets coerced to integer. To prevent that, use data.frame or cbind.data.frame

(df.des <- data.frame(n[1], n=n[,2], mean=m[,2], 
         sd=round(sd[,2],3),se=round(sd[,2]/sqrt(n[,2]),3)))

As a reproducible example

cbind(factor(letters[1:3]), 4:6) # // returns a matrix
#     [,1] [,2]
#[1,]    1    4
#[2,]    2    5
#[3,]    3    6

The reason is because of coercion to integer storage values

as.integer(factor(letters[1:3]))
#[1] 1 2 3

If the column was character, it changes the whole matrix to character

cbind(letters[1:3], 4:6)
#    [,1] [,2]
#[1,] "a"  "4" 
#[2,] "b"  "5" 
#[3,] "c"  "6" 

Both are incorrect. So, either use data.frame or use the data.frame method from cbind

cbind.data.frame(a1 = factor(letters[1:3]), b = 4:6) 
#   a1 b
#1  a 4
#2  b 5
#3  c 6

NOTE: THis answers the OP's original question about why it didn't work


Multiple functions can be applied in a more flexible way in tidyverse. In base R, aggregate is not an ideal function to apply more than one function as it is returns a matrix. In addition, if there are NA values, we may have to take care of the NA by specifying na.action as well as the na.rm argument from mean or sd

library(dplyr)
df %>%
     group_by(state) %>%
     summarise(n = n(), across(Rating, 
         list(mean = ~ mean(., na.rm = TRUE), sd = ~ sd(., na.rm = TRUE))))

Upvotes: 0

jay.sf
jay.sf

Reputation: 73397

You may want to apply all statistics at once in aggregate, this will save some pain.

f <- function(x) c(mean=mean(x), sd=sd(x), se=sd(x)/sqrt(length(x)), n=length(x))

r <- do.call(data.frame, aggregate(Rating ~ state, data=df, FUN=f))
r
#   state Rating.mean Rating.sd Rating.se Rating.n
# 1     A    4.000000  3.000000 1.7320508        3
# 2     B    3.666667  1.527525 0.8819171        3
# 3     C    6.666667  4.932883 2.8480012        3
# 4     D    5.000000  4.000000 2.3094011        3
# 5     E    7.333333  3.055050 1.7638342        3

Note: To learn why we need do.call(data.frame, .) here, see this answer.


Data:

set.seed(42)
df <- data.frame(Rating=sample(1:10, 15, replace=T),
                 state=rep(LETTERS[1:5], 3))

Upvotes: 1

Related Questions