Reputation: 3

Convert "select all that apply" to binary choices

I have a data frame of survey responses, and some of the columns are questions where participants can select multiple answers ("select all that apply").

> age <- c(24, 28, 44, 55, 53)
> ethnicity <- c("ngoni", "bemba", "lozi tonga", "bemba tonga other", "bemba tongi")
> ethnicity_other <- c(NA, NA, "luvale", NA, NA) 
> df <- data.frame(age, ethnicity, ethnicity_other)

I would like those questions to be set up as binary-response items, so that each of the response choices (in this case ethnicity and ethnicity_other) becomes a column vector with either a 0 or a 1.

So far, I wrote a script that separates the individual unique responses into a list (z):

> x <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity_other), " ")),    mode="list"))
> y <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity), " ")), mode="list"))
>
> combine <- c(x, y)
>
> z <- NULL
> for(i in combine){
> if(!is.na(i)){
> z <- append(z, i)
>   }   
> }

I then created new columns from that list and filled them with NA values.

> for(elm in z){
>   df[paste0("ethnicity_",elm)]  <- NA
> }

So now I have 35 additional columns that I would like to fill with ones and zeros, depending on whether that column name (or part of that column name, as I prefix it with ethnicity_) can be found in the corresponding cell under ethnicity or ethnicity_other. I tried taking a stab at it a number of ways with no good solution.

Upvotes: 0

Answers (3)

A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Here's an approach using concat.split.expanded from my "splitstackshape" package:

## Combine your "ethnicity" and "ethnicity_other" column
df$ethnicity <- paste(df$ethnicity, 
                      ifelse(is.na(df$ethnicity_other), "", 
                             as.character(df$ethnicity_other)))
## Drop the original "ethnicity_other" column
df$ethnicity_other <- NULL

## Split up the new "ethnicity" column
library(splitstackshape)
concat.split.expanded(df, "ethnicity", sep=" ", 
                      type="character", fill=0, drop=TRUE)
#   age ethnicity_bemba ethnicity_lozi ethnicity_luvale ethnicity_ngoni
# 1  24               0              0                0               1
# 2  28               1              0                0               0
# 3  44               0              1                1               0
# 4  55               1              0                0               0
# 5  53               1              0                0               0
#   ethnicity_other ethnicity_tonga ethnicity_tongi
# 1               0               0               0
# 2               0               0               0
# 3               0               1               0
# 4               1               1               0
# 5               0               0               1

The fill argument can easily be set to anything else you want. It defaults to NA, but here, I've set it to 0 since I think that's what you're looking for.

Upvotes: 0

Jake Burkhead

Reputation: 6535

Here's a couple ways to do this with plyr or data.table.

all_ethnicities <- unique(c(
    unlist(strsplit(df$ethnicity, " ")),
    unlist(strsplit(df$ethnicity_other, " "))
    ))

df$id <- 1:nrow(df)

library(plyr)

ddply(df, .(id), function(x)
      table(factor(unlist(strsplit(paste(x$ethnicity, x$ethnicity_other), " ")),
                   levels = all_ethnicities)))

##    id ngoni bemba lozi tonga other tongi luvale
## 1  1     1     0    0     0     0     0      0
## 2  2     0     1    0     0     0     0      0
## 3  3     0     0    1     1     0     0      1
## 4  4     0     1    0     1     1     0      0
## 5  5     0     1    0     0     0     1      0

library(data.table)

DT <- data.table(df)

DT[, {
    as.list(
        table(
            factor(
                unlist(strsplit(paste(ethnicity, ethnicity_other),  " ")),
                levels = all_ethnicities)
            ),
        )
}, by = id]

##     id ngoni bemba lozi tonga other tongi luvale
## 1:  1     1     0    0     0     0     0      0
## 2:  2     0     1    0     0     0     0      0
## 3:  3     0     0    1     1     0     0      1
## 4:  4     0     1    0     1     1     0      0
## 5:  5     0     1    0     0     0     1      0

Upvotes: 1

Jealie

Reputation: 6267

Here is how I would do it:

First, you need something to store the ethnicities of each participant. My way to do it is to build a list of these:

ethnicities = sapply(X=df$ethnicity, FUN=function(response) {return (strsplit(as.character(response), " "))} )

For your particular example, we would have:

> ethnicities
[[1]]
[1] "ngoni"

[[2]]
[1] "bemba"

[[3]]
[1] "lozi"  "tonga"

[[4]]
[1] "bemba" "tonga" "other"

[[5]]
[1] "bemba" "tongi"

And then, to iterate over these to fill your data.frame df:

for (i in seq_along(ethnicities)) {
  for (eth in ethnicities[[i]]) {
    df[[paste0('ethnicity_',eth)]][i]=1
  }
}

The resulting value for df should be:

> df
  age         ethnicity ethnicity_other ethnicity_luvale ethnicity_ngoni ethnicity_bemba
1  24             ngoni              NA               NA               1              NA
2  28             bemba              NA               NA              NA               1
3  44        lozi tonga              NA               NA              NA              NA
4  55 bemba tonga other               1               NA              NA               1
5  53       bemba tongi              NA               NA              NA               1
  ethnicity_lozi ethnicity_tonga ethnicity_tongi
1             NA              NA              NA
2             NA              NA              NA
3              1               1              NA
4             NA               1              NA
5             NA              NA               1

There are other ways to do it. You could also pack these two for loops in sapply, but I have the feeling that the resulting code would not be more efficient (but would be more complicated to read!).

Does this help?

edit:

BTW, if you really want 0 instead of NA in your data.frame, it is as easy as changing your code initializing the added columns:

> for(elm in z){
>   df[paste0("ethnicity_",elm)]  <- 0 # instead of NA
> }

Upvotes: 0

Convert &quot;select all that apply&quot; to binary choices

Answers (3)

Related Questions

Convert "select all that apply" to binary choices