Reputation: 3
In the DataAnalyst data (from Kaggle), I am trying to show descriptive statistics of Ratings (numeric value) by state (categorical factor). I am able to successfully display everything but the state names which shows up as #s:
m<-aggregate(Rating~state, data=df,mean)
sd<-aggregate(Rating~state, data=df,sd)
n<-aggregate(Rating~state, data=df,length)
##summary descriptive table
(df.des <- cbind(n[,1], n=n[,2], mean=m[,2], sd=round(sd[,2],3),se=round(sd[,2]/sqrt(n[,2]),3)))
For df.des, I understand the n[,1] displays the # col you want. I have tried n[,2] which brings up the number/state. How can I get the table to display the names and not the numbers? P.S. "State" is listed as characters (e.g. CA, NY, IL) and not numbers.
descriptive statistics with numbers instead of state categories
Upvotes: 0
Views: 191
Reputation: 887621
cbind
by default uses cbind.matrix
instead of cbind.data.frame
and matrix
can have only a single class
. So, if we there is any column with character
, it converts to character
class. If the column is factor
, because it is stored as integer, it gets coerced to integer
. To prevent that, use data.frame
or cbind.data.frame
(df.des <- data.frame(n[1], n=n[,2], mean=m[,2],
sd=round(sd[,2],3),se=round(sd[,2]/sqrt(n[,2]),3)))
As a reproducible example
cbind(factor(letters[1:3]), 4:6) # // returns a matrix
# [,1] [,2]
#[1,] 1 4
#[2,] 2 5
#[3,] 3 6
The reason is because of coercion to integer
storage values
as.integer(factor(letters[1:3]))
#[1] 1 2 3
If the column was character
, it changes the whole matrix to character
cbind(letters[1:3], 4:6)
# [,1] [,2]
#[1,] "a" "4"
#[2,] "b" "5"
#[3,] "c" "6"
Both are incorrect. So, either use data.frame
or use the data.frame
method from cbind
cbind.data.frame(a1 = factor(letters[1:3]), b = 4:6)
# a1 b
#1 a 4
#2 b 5
#3 c 6
NOTE: THis answers the OP's original question about why it didn't work
Multiple functions can be applied in a more flexible way in tidyverse
. In base R
, aggregate
is not an ideal function to apply more than one function as it is returns a matrix
. In addition, if there are NA
values, we may have to take care of the NA
by specifying na.action
as well as the na.rm
argument from mean
or sd
library(dplyr)
df %>%
group_by(state) %>%
summarise(n = n(), across(Rating,
list(mean = ~ mean(., na.rm = TRUE), sd = ~ sd(., na.rm = TRUE))))
Upvotes: 0
Reputation: 73397
You may want to apply all statistics at once in aggregate
, this will save some pain.
f <- function(x) c(mean=mean(x), sd=sd(x), se=sd(x)/sqrt(length(x)), n=length(x))
r <- do.call(data.frame, aggregate(Rating ~ state, data=df, FUN=f))
r
# state Rating.mean Rating.sd Rating.se Rating.n
# 1 A 4.000000 3.000000 1.7320508 3
# 2 B 3.666667 1.527525 0.8819171 3
# 3 C 6.666667 4.932883 2.8480012 3
# 4 D 5.000000 4.000000 2.3094011 3
# 5 E 7.333333 3.055050 1.7638342 3
Note: To learn why we need do.call(data.frame, .)
here, see this answer.
Data:
set.seed(42)
df <- data.frame(Rating=sample(1:10, 15, replace=T),
state=rep(LETTERS[1:5], 3))
Upvotes: 1