petergensler
petergensler

Reputation: 342

dcast summarise on single column

I would like to pivot my data so I can get the average survival rate using dcast, but it seems like it may not be possible:

Data

PassengerId Survived    Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
1   0   3   Braund, Mr. Owen Harris male    22  1   0   A/5 21171   7.25        S
2   1   1   Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38  1   0   PC 17599    71.2833 C85 C
3   1   3   Heikkinen, Miss. Laina  female  26  0   0   STON/O2. 3101282    7.925       S

code for sample data:

df <- structure(list(PassengerId = 1:6, Survived = structure(c(1L, 
                                                                  2L, 2L, 2L, 1L, 1L), .Label = c("0", "1"), class = "factor"), 
                        Pclass = c(3L, 1L, 3L, 1L, 3L, 3L), Name = c("Braund, Mr. Owen Harris", 
                                                                     "Cumings, Mrs. John Bradley (Florence Briggs Thayer)", "Heikkinen, Miss. Laina", 
                                                                     "Futrelle, Mrs. Jacques Heath (Lily May Peel)", "Allen, Mr. William Henry", 
                                                                     "Moran, Mr. James"), Sex = c("male", "female", "female", 
                                                                                                  "female", "male", "male"), Age = c(22, 38, 26, 35, 35, NA
                                                                                                  ), SibSp = c(1L, 1L, 0L, 1L, 0L, 0L), Parch = c(0L, 0L, 0L, 
                                                                                                                                                  0L, 0L, 0L), Ticket = c("A/5 21171", "PC 17599", "STON/O2. 3101282", 
                                                                                                                                                                          "113803", "373450", "330877"), Fare = c(7.25, 71.2833, 7.925, 
                                                                                                                                                                                                                  53.1, 8.05, 8.4583), Cabin = c("", "C85", "", "C123", "", 
                                                                                                                                                                                                                                                 ""), Embarked = c("S", "C", "S", "S", "S", "Q")), .Names = c("PassengerId", 
                                                                                                                                                                                                                                                                                                              "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", 
                                                                                                                                                                                                                                                                                                              "Ticket", "Fare", "Cabin", "Embarked"), row.names = c(NA, 6L), class = "data.frame")

Function so far:

reshape2::dcast(titanic, Sex ~ ., mean)

Desired Output:

Row Label  Average of Survived 
Male       3.14156  
Female     3.14156

Currently, it returns this error:

     Sex  .
1 female NA
2   male NA
Warning messages:
1: In mean.default(.value[0], ...) :
  argument is not numeric or logical: returning NA

I think this was possible to do with the cast function in reshape, but is this possible to do with reshape2?

Upvotes: 0

Views: 741

Answers (3)

Uwe
Uwe

Reputation: 42544

This could be done with dcast() from the reshape2 (or the data.table) package as shown by the OP's own answer.

Without dcast(), you could do the aggregation directly with data.table as well:

library(data.table)
setDT(df)[, Survived := as.numeric(as.character(Survived))][, mean(Survived), by = Sex]
#      Sex V1
#1:   male  0
#2: female  1

df is used as given by the dput() in the Q. Chaining is used to form a "one-liner".

An even more concise version of the above would be

setDT(df)[, mean(as.numeric(as.character(Survived))), by = Sex]

Upvotes: 2

shiny
shiny

Reputation: 3502

What about trying it using dplyr?

library(dplyr)
output <-  df  %>% 
  dplyr::mutate(Survived = as.numeric(as.character(Survived))) %>%  
  dplyr::select(Sex, Survived) %>% 
  dplyr::group_by(Sex) %>% 
  dplyr::summarise(average_of_survived = mean(Survived))
output
## A tibble: 2 × 2
#     Sex average_of_survived
#   <chr>               <dbl>
#1 female                   1
#2   male                   0

Upvotes: 2

petergensler
petergensler

Reputation: 342

So, you can indeed use dcast for this, but Survived was a factor, which was throwing an error, and you need to define which column you want to use as a value to compute on. Turns out the column order does not matter either, which is surprising.

df$Survived <- as.numeric(as.character(df$Survived))
reshape2::dcast(df, Sex~., mean, value.var = "Survived")
#     Sex .
#1 female 1
#2   male 0

Upvotes: 2

Related Questions