Reputation: 3
I have a data frame of survey responses, and some of the columns are questions where participants can select multiple answers ("select all that apply").
> age <- c(24, 28, 44, 55, 53)
> ethnicity <- c("ngoni", "bemba", "lozi tonga", "bemba tonga other", "bemba tongi")
> ethnicity_other <- c(NA, NA, "luvale", NA, NA)
> df <- data.frame(age, ethnicity, ethnicity_other)
I would like those questions to be set up as binary-response items, so that each of the response choices (in this case ethnicity
and ethnicity_other
) becomes a column vector with either a 0 or a 1.
So far, I wrote a script that separates the individual unique responses into a list (z
):
> x <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity_other), " ")), mode="list"))
> y <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity), " ")), mode="list"))
>
> combine <- c(x, y)
>
> z <- NULL
> for(i in combine){
> if(!is.na(i)){
> z <- append(z, i)
> }
> }
I then created new columns from that list and filled them with NA values.
> for(elm in z){
> df[paste0("ethnicity_",elm)] <- NA
> }
So now I have 35 additional columns that I would like to fill with ones and zeros, depending on whether that column name (or part of that column name, as I prefix it with ethnicity_
) can be found in the corresponding cell under ethnicity
or ethnicity_other.
I tried taking a stab at it a number of ways with no good solution.
Upvotes: 0
Views: 776
Reputation: 193517
Here's an approach using concat.split.expanded
from my "splitstackshape" package:
## Combine your "ethnicity" and "ethnicity_other" column
df$ethnicity <- paste(df$ethnicity,
ifelse(is.na(df$ethnicity_other), "",
as.character(df$ethnicity_other)))
## Drop the original "ethnicity_other" column
df$ethnicity_other <- NULL
## Split up the new "ethnicity" column
library(splitstackshape)
concat.split.expanded(df, "ethnicity", sep=" ",
type="character", fill=0, drop=TRUE)
# age ethnicity_bemba ethnicity_lozi ethnicity_luvale ethnicity_ngoni
# 1 24 0 0 0 1
# 2 28 1 0 0 0
# 3 44 0 1 1 0
# 4 55 1 0 0 0
# 5 53 1 0 0 0
# ethnicity_other ethnicity_tonga ethnicity_tongi
# 1 0 0 0
# 2 0 0 0
# 3 0 1 0
# 4 1 1 0
# 5 0 0 1
The fill
argument can easily be set to anything else you want. It defaults to NA
, but here, I've set it to 0
since I think that's what you're looking for.
Upvotes: 0
Reputation: 6535
Here's a couple ways to do this with plyr
or data.table
.
all_ethnicities <- unique(c(
unlist(strsplit(df$ethnicity, " ")),
unlist(strsplit(df$ethnicity_other, " "))
))
df$id <- 1:nrow(df)
library(plyr)
ddply(df, .(id), function(x)
table(factor(unlist(strsplit(paste(x$ethnicity, x$ethnicity_other), " ")),
levels = all_ethnicities)))
## id ngoni bemba lozi tonga other tongi luvale
## 1 1 1 0 0 0 0 0 0
## 2 2 0 1 0 0 0 0 0
## 3 3 0 0 1 1 0 0 1
## 4 4 0 1 0 1 1 0 0
## 5 5 0 1 0 0 0 1 0
library(data.table)
DT <- data.table(df)
DT[, {
as.list(
table(
factor(
unlist(strsplit(paste(ethnicity, ethnicity_other), " ")),
levels = all_ethnicities)
),
)
}, by = id]
## id ngoni bemba lozi tonga other tongi luvale
## 1: 1 1 0 0 0 0 0 0
## 2: 2 0 1 0 0 0 0 0
## 3: 3 0 0 1 1 0 0 1
## 4: 4 0 1 0 1 1 0 0
## 5: 5 0 1 0 0 0 1 0
Upvotes: 1
Reputation: 6267
Here is how I would do it:
First, you need something to store the ethnicities of each participant. My way to do it is to build a list of these:
ethnicities = sapply(X=df$ethnicity, FUN=function(response) {return (strsplit(as.character(response), " "))} )
For your particular example, we would have:
> ethnicities
[[1]]
[1] "ngoni"
[[2]]
[1] "bemba"
[[3]]
[1] "lozi" "tonga"
[[4]]
[1] "bemba" "tonga" "other"
[[5]]
[1] "bemba" "tongi"
And then, to iterate over these to fill your data.frame df:
for (i in seq_along(ethnicities)) {
for (eth in ethnicities[[i]]) {
df[[paste0('ethnicity_',eth)]][i]=1
}
}
The resulting value for df should be:
> df
age ethnicity ethnicity_other ethnicity_luvale ethnicity_ngoni ethnicity_bemba
1 24 ngoni NA NA 1 NA
2 28 bemba NA NA NA 1
3 44 lozi tonga NA NA NA NA
4 55 bemba tonga other 1 NA NA 1
5 53 bemba tongi NA NA NA 1
ethnicity_lozi ethnicity_tonga ethnicity_tongi
1 NA NA NA
2 NA NA NA
3 1 1 NA
4 NA 1 NA
5 NA NA 1
There are other ways to do it. You could also pack these two for loops in sapply, but I have the feeling that the resulting code would not be more efficient (but would be more complicated to read!).
Does this help?
edit:
BTW, if you really want 0 instead of NA in your data.frame, it is as easy as changing your code initializing the added columns:
> for(elm in z){
> df[paste0("ethnicity_",elm)] <- 0 # instead of NA
> }
Upvotes: 0