Reputation: 427
I need to recode a data set of test responses for use in another application (a program called BLIMP that imputes missing values). Specifically, I need to represent the test items and subscale assignments with dummy codes.
Here I create a data frame that holds the responses to a 10-item test for two persons in a nested format. These data are a simplified version of the actual input table.
library(tidyverse)
df <- tibble(
person = rep(101:102, each = 10),
item = as.factor(rep(1:10, 2)),
response = sample(1:4, 20, replace = T),
scale = as.factor(rep(rep(1:2, each = 5), 2))
) %>% mutate(
scale_last = case_when(
as.integer(scale) != lead(as.integer(scale)) | is.na(lead(as.integer(scale))) ~ 1,
TRUE ~ NA_real_
)
)
The columns of df
contain:
person
: ID numbers for the persons (10 rows for each person)item
: test items 1-10 for each person. Note how the items are nested within each person.response
: score for each itemscale
: the test has two subscales. Items 1-5 are assigned to subscale 1, and items 6-10 are assigned to subscale 2.scale_last
: a code of 1
in this column indicates that the item is the last item in its assigned sub scale. This characteristic becomes important below.I then create dummy codes for the items using the recipes
package.
library(recipes)
dum <- df %>%
recipe(~ .) %>%
step_dummy(item, one_hot = T) %>%
prep(training = df) %>%
bake(new_data = df)
print(dum, width = Inf)
# person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5 item_X6 item_X7
# <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 2 1 NA 1 0 0 0 0 0 0
# 2 101 3 1 NA 0 1 0 0 0 0 0
# 3 101 3 1 NA 0 0 1 0 0 0 0
# 4 101 1 1 NA 0 0 0 1 0 0 0
# 5 101 1 1 1 0 0 0 0 1 0 0
# 6 101 1 2 NA 0 0 0 0 0 1 0
# 7 101 3 2 NA 0 0 0 0 0 0 1
# 8 101 4 2 NA 0 0 0 0 0 0 0
# 9 101 2 2 NA 0 0 0 0 0 0 0
#10 101 4 2 1 0 0 0 0 0 0 0
#11 102 2 1 NA 1 0 0 0 0 0 0
#12 102 1 1 NA 0 1 0 0 0 0 0
#13 102 2 1 NA 0 0 1 0 0 0 0
#14 102 3 1 NA 0 0 0 1 0 0 0
#15 102 2 1 1 0 0 0 0 1 0 0
#16 102 1 2 NA 0 0 0 0 0 1 0
#17 102 4 2 NA 0 0 0 0 0 0 1
#18 102 2 2 NA 0 0 0 0 0 0 0
#19 102 4 2 NA 0 0 0 0 0 0 0
#20 102 3 2 1 0 0 0 0 0 0 0
# item_X8 item_X9 item_X10
# <dbl> <dbl> <dbl>
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 0 0 0
# 6 0 0 0
# 7 0 0 0
# 8 1 0 0
# 9 0 1 0
#10 0 0 1
#11 0 0 0
#12 0 0 0
#13 0 0 0
#14 0 0 0
#15 0 0 0
#16 0 0 0
#17 0 0 0
#18 1 0 0
#19 0 1 0
#20 0 0 1
The output shows the item dummy codes represented in the columns with the item_
prefix. For downstream processing, I need a further level of recoding. Within each subscale, the items must be dummy-coded relative to the last item of the subscale. Here’s where the scale_last
variable comes into play; this variable identifies the rows in the output that need to be recoded.
For example, the first of these rows is row 5, the row for the last item (item 5) in subscale 1 for person 101. In this row the value of column item_X5
needs to be recoded from 1
to 0
. In the next row to be recoded (row 10), it is the value of item_X10
that needs to be recoded from 1
to 0
. And so on.
I’m struggling for the right combination of dplyr
verbs to accomplish this. What’s tripping me up is the need to isolate specific cells within specific rows to be recoded.
Thanks in advance for any help!
Upvotes: 1
Views: 147
Reputation: 388817
We can use mutate_at
and replace
values from "item"
columns to 0 where scale_last == 1
library(dplyr)
dum %>% mutate_at(vars(starts_with("item")), ~replace(., scale_last == 1, 0))
# A tibble: 20 x 14
# person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5
# <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 101 2 1 NA 1 0 0 0 0
# 2 101 3 1 NA 0 1 0 0 0
# 3 101 1 1 NA 0 0 1 0 0
# 4 101 1 1 NA 0 0 0 1 0
# 5 101 3 1 1 0 0 0 0 0
# 6 101 4 2 NA 0 0 0 0 0
# 7 101 4 2 NA 0 0 0 0 0
# 8 101 3 2 NA 0 0 0 0 0
# 9 101 2 2 NA 0 0 0 0 0
#10 101 4 2 1 0 0 0 0 0
#11 102 2 1 NA 1 0 0 0 0
#12 102 1 1 NA 0 1 0 0 0
#13 102 4 1 NA 0 0 1 0 0
#14 102 4 1 NA 0 0 0 1 0
#15 102 4 1 1 0 0 0 0 0
#16 102 3 2 NA 0 0 0 0 0
#17 102 4 2 NA 0 0 0 0 0
#18 102 1 2 NA 0 0 0 0 0
#19 102 4 2 NA 0 0 0 0 0
#20 102 4 2 1 0 0 0 0 0
# … with 5 more variables: item_X6 <dbl>, item_X7 <dbl>, item_X8 <dbl>,
# item_X9 <dbl>, item_X10 <dbl>
In base R, we can use lapply
cols <- grep("^item", names(dum))
dum[cols] <- lapply(dum[cols], function(x) replace(x, dum$scale_last == 1, 0))
Upvotes: 1