Thredolsen
Thredolsen

Reputation: 257

Calculating percent of categorical responses (with grouping) in R

I have the following dataframe:

IV      Device1     Device2    Device3
Color   Same        Same       Missing
Color   Different   Same       Missing
Color   Same        Unique     Missing
Shape   Same        Missing    Same
Shape   Different   Same       Different

Explanation: each IV (Independent Variable) is composed of several measurements (the ‘Color’ section is composed of 3 different measurements, while 'Shape' is composed of 2).

Each data point has one of 4 possible categorical values: Same/Different/Unique/Missing. 'Missing' means that there is no value for that measurement in the case of that device, while the other 3 values represent the existing result for that measurement.

Question: I want to calculate for each device the percent of times that it has a Same/Different/Unique value (thus generating 3 different percentages), out of the total number of values for that IV (not including cases where there is a ‘Missing’ value).

For example, device 2 would have the following percentages:

Thank you!

Upvotes: 0

Views: 486

Answers (2)

Sumedh
Sumedh

Reputation: 4965

This is a not a TIDY solution, but you can use this until someone else posts a better one:

# Replace all "Missing" with NAs
df[df == "Missing"] <- NA


# Create factor levels
df[,-1] <- lapply(df[,-1], function(x) {
        factor(x, levels = c('Same', 'Different', 'Unique'))
})


# Custom function to calculate percent of categorical responses
custom <- function(x) {
        y <- length(na.omit(x))
        if(y > 0) 
                return(round((table(x)/y)*100))
        else
                return(rep(0, 3))
}


library(purrr)

# Split the dataframe on IV, remove the IV column and apply the custom function
Final <- df %>% split(df$IV) %>% 
    map(., function(x) {
      x <- x[, -1]
      t(sapply(x, custom))
    })

Output

Final is a list of two data frames:

$Color
        Same Different Unique
Device1   67        33      0
Device2   67         0     33
Device3    0         0      0

$Shape
        Same Different Unique
Device1   50        50      0
Device2  100         0      0
Device3   50        50      0

Data

structure(list(IV = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("Color", 
"Shape"), class = "factor"), Device1 = structure(c(1L, 2L, 1L, 
1L, 2L), .Label = c("Same", "Different", "Unique"), class = "factor"), 
    Device2 = structure(c(1L, 1L, 3L, NA, 1L), .Label = c("Same", 
    "Different", "Unique"), class = "factor"), Device3 = structure(c(NA, 
    NA, NA, 1L, 2L), .Label = c("Same", "Different", "Unique"
    ), class = "factor")), .Names = c("IV", "Device1", "Device2", 
"Device3"), row.names = c(NA, -5L), class = "data.frame")

Upvotes: 1

Ricardo A.
Ricardo A.

Reputation: 156

Quick and dirty: First, replace your 'Missing' by 'NA' using your preferred method (sed, excel, etc), then you can use table on each of the columns to get the summary statistics:

myStats <- function(x){
    table(factor(x, levels = c('Same', 'Different', 'Unique')))/sum(table(x))
}    
apply(yourData, 2, myStats)

This will return the summary of what you want.

Upvotes: 1

Related Questions