Reputation:
Table1$subject contains the variables "Biology", "Chemistry", and "Physics". For Table 2, I want to recode this, to replace all instances of Biology/Chemistry with 1 and all instances of Physics with 0.
I tried the following code, since I believe this is achievable using the recode and case_when commands:
Table2 <- recode(Table1, case_when(
.$subject <= "biology" ~ 1,
.$subject <= "chemistry" ~ 1,
.$subject <= "physics" ~ 0))
Currently, I get an error message saying "case_when must be a two-sided formula, not a logical". I'm new to R so I'm not quite sure what I'm doing wrong. Really grateful if anyone has any ideas!
Upvotes: 4
Views: 5754
Reputation: 14958
This reminded me of when I first started working with R, too, and I walked over and asked the data scientists this very same question.
They shared with me a different approach that is generally preferable in these situations. I have looked back many times and appreciated learning it early on.
The database normalization approach (unless someone out there can help us with a better name) involves mapping your code values into a separate dataframe. Then you take that collection of mapped values and join
them to the dataframe you are wanting to encode.
This helps keep the code more strictly responsible for manipulations, and the dataframes responsible for holding values/data. This not only can speed up much of your work, saving you from hand-coding in hard-coded lookup tables, but in the longer term it will make it much easier when someone is debugging or performing modifications and re-developments.
The normalized data management approach then would look like:
# your code mapping
df_map <- tribble(~subject, ~subj_cd,
"chemistry", 1,
"biology", 1,
"physics", 0)
# a dummy raw dataframe that you might be wanting to encode
df_raw <- tibble(stud_id = 2678:2877,
subject = sample(c("chemistry",
"biology",
"physics",
"astronomy"), 200, replace = TRUE))
# encoding the data
df_coded <-
df_raw %>%
left_join(df_map)
df_code
> df_coded # A tibble: 200 x 3 stud_id subject subj_cd <int> <chr> <dbl> 1 2678 physics 2 2 2679 physics 2 3 2680 biology 1 4 2681 astronomy NA 5 2682 chemistry 1 6 2683 chemistry 1 7 2684 physics 2 8 2685 chemistry 1 9 2686 chemistry 1 10 2687 astronomy NA # ... with 190 more rows
If you find yourself needing a quick and easy way to build longer code maps (or, especially, share them with other folks), then you will probably find Jenny Brian's googlesheets
package very helpful (she's a member of team tidyverse
) A really helpful vignette for it can be found here
Upvotes: 3
Reputation: 3007
Both recode
and case_when
operate on vectors, not data frames. So to create a new data frame you need to first call mutate
, and then within mutate
use either recode
or case_when
to create a new column (or overwrite an existing one).
(Also, as of the latest dplyr release you no longer need to use the .$
when using case_when
)
library(tibble)
library(dplyr)
df <- tribble(
~subject,
"chemistry",
"biology",
"physics"
)
df %>%
mutate(subject2 = case_when(
subject == "chemistry" ~ 1,
subject == "biology" ~ 1,
subject == "physics" ~ 2,
))
#> # A tibble: 3 x 2
#> subject subject2
#> <chr> <dbl>
#> 1 chemistry 1
#> 2 biology 1
#> 3 physics 2
df %>%
mutate(subject2 = recode(
subject,
"chemistry" = 1,
"biology" = 1,
"physics" = 2,
))
#> # A tibble: 3 x 2
#> subject subject2
#> <chr> <dbl>
#> 1 chemistry 1
#> 2 biology 1
#> 3 physics 2
Upvotes: 5