Reputation: 257
I have the following dataframe:
IV Device1 Device2 Device3
Color Same Same Missing
Color Different Same Missing
Color Same Unique Missing
Shape Same Missing Same
Shape Different Same Different
Explanation: each IV (Independent Variable) is composed of several measurements (the ‘Color’ section is composed of 3 different measurements, while 'Shape' is composed of 2).
Each data point has one of 4 possible categorical values: Same/Different/Unique/Missing. 'Missing' means that there is no value for that measurement in the case of that device, while the other 3 values represent the existing result for that measurement.
Question: I want to calculate for each device the percent of times that it has a Same/Different/Unique value (thus generating 3 different percentages), out of the total number of values for that IV (not including cases where there is a ‘Missing’ value).
For example, device 2 would have the following percentages:
Thank you!
Upvotes: 0
Views: 486
Reputation: 4965
This is a not a TIDY solution, but you can use this until someone else posts a better one:
# Replace all "Missing" with NAs
df[df == "Missing"] <- NA
# Create factor levels
df[,-1] <- lapply(df[,-1], function(x) {
factor(x, levels = c('Same', 'Different', 'Unique'))
})
# Custom function to calculate percent of categorical responses
custom <- function(x) {
y <- length(na.omit(x))
if(y > 0)
return(round((table(x)/y)*100))
else
return(rep(0, 3))
}
library(purrr)
# Split the dataframe on IV, remove the IV column and apply the custom function
Final <- df %>% split(df$IV) %>%
map(., function(x) {
x <- x[, -1]
t(sapply(x, custom))
})
Output
Final is a list of two data frames:
$Color
Same Different Unique
Device1 67 33 0
Device2 67 0 33
Device3 0 0 0
$Shape
Same Different Unique
Device1 50 50 0
Device2 100 0 0
Device3 50 50 0
Data
structure(list(IV = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("Color",
"Shape"), class = "factor"), Device1 = structure(c(1L, 2L, 1L,
1L, 2L), .Label = c("Same", "Different", "Unique"), class = "factor"),
Device2 = structure(c(1L, 1L, 3L, NA, 1L), .Label = c("Same",
"Different", "Unique"), class = "factor"), Device3 = structure(c(NA,
NA, NA, 1L, 2L), .Label = c("Same", "Different", "Unique"
), class = "factor")), .Names = c("IV", "Device1", "Device2",
"Device3"), row.names = c(NA, -5L), class = "data.frame")
Upvotes: 1
Reputation: 156
Quick and dirty: First, replace your 'Missing' by 'NA' using your preferred method (sed, excel, etc), then you can use table on each of the columns to get the summary statistics:
myStats <- function(x){
table(factor(x, levels = c('Same', 'Different', 'Unique')))/sum(table(x))
}
apply(yourData, 2, myStats)
This will return the summary of what you want.
Upvotes: 1