Reputation: 785
I'm trying to realize why I cannot use dplyr::case_when
rather than dplyr::if_else
.
Probably I'm missing something. Let me explain:
I got this operation which works fine:
df %>%
mutate(
keep = if_else(
assembly_level != "Complete Genome" | genome_rep != "Full",
FALSE,
ifelse(
version_status == "suppressed",
FALSE,
if_else(
refseq_category %in% c("reference genome", "representative genome"),
TRUE,
if_else(
rpseudo > 0.4,
FALSE,
TRUE
)
)
)
)
)
but, when I try using case_when
this way
df %>%
mutate(
keep = case_when(
assembly_level != "Complete Genome" | genome_rep != "Full" ~ FALSE,
version_status == "suppressed" ~ FALSE,
refseq_category %in% c("reference genome", "representative genome") ~ TRUE,
rpseudo > 0.4 ~ FALSE,
TRUE ~ TRUE
)
)
I got different results.
I think the problem is just the use of the function.
If you need the data, it is a general public data and may be downloaded here: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
to get:
read_tsv("ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt",
comment = "#",
col_names = c(
"assembly", "bioproject", "biosample",
"wgs_master", "refseq_category", "taxid",
"species_taxid", "organism_name", "infraspecific_name",
"isolate", "version_status", "assembly_level",
"release_type", "genome_rep", "seq_rel_date",
"asm_name", "submitter", "gbrs_paired_asm",
"paired_asm_comp", "ftp_path", "excluded_from_refseq", "relation_to_type_material"
)
) %>%
select(assembly, refseq_category,
assembly_level, genome_rep,
version_status, release_type) %>%
mutate(
rpseudo = runif(nrow(.), 0, 1)
) -> df
# it will got some warnings
Thanks in advance,
Upvotes: 2
Views: 4039
Reputation: 388982
There are NA
's in the data. Store the output from if_else
in df1
and the one with case_when
in df2
. The only difference between df1$keep
and df2$keep
is df1$keep
has got few NA
s in them and at those place case_when
has got some real values. Check
table(df1$keep, useNA = "always")
# FALSE TRUE <NA>
#156616 10386 79
table(df2$keep, useNA = "always")
# FALSE TRUE <NA>
#156647 10434 0
and if you do
(156647 - 156616) + (10434 - 10386) #It gives exactly
#[1] 79
Also if you remove those NA
values and then check values in df1
and df2
they are the same.
all(df1$keep[!is.na(df1$keep)] == df2$keep[!is.na(df1$keep)])
#[1] TRUE
The way NA
is being handled in if_else
and case_when
is different. Consider this simplified example for better understanding.
library(dplyr)
df <- data.frame(a = c(1:3, NA, 4:7), b = c(NA, letters[1:7]))
Now let's create some random conditions to test. Using if_else
df %>%
mutate(res = if_else(a > 3, "Yes",
if_else(b == "c", "No",
if_else(a > 5, "Maybe", "Done"))))
# a b res
#1 1 <NA> <NA>
#2 2 a Done
#3 3 b Done
#4 NA c <NA>
#5 4 d Yes
#6 5 e Yes
#7 6 f Yes
#8 7 g Yes
However, with case_when
you get output as
df %>%
mutate(res = case_when(a > 3 ~ "Yes",
b == "c"~"No",
a > 5 ~ "Maybe",
TRUE ~ "Done"))
# a b res
#1 1 <NA> Done
#2 2 a Done
#3 3 b Done
#4 NA c No
#5 4 d Yes
#6 5 e Yes
#7 6 f Yes
#8 7 g Yes
So if you notice in if_else
if an NA
is encountered it returns NA
immediately. However, in case_when
it treats NA
as FALSE
so if NA
is encountered it goes to next condition until any condition is satisfied or else return value of TRUE
.
data
set.seed(1234)
read_tsv("ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt",
comment = "#",
col_names = c(
"assembly", "bioproject", "biosample",
"wgs_master", "refseq_category", "taxid",
"species_taxid", "organism_name", "infraspecific_name",
"isolate", "version_status", "assembly_level",
"release_type", "genome_rep", "seq_rel_date",
"asm_name", "submitter", "gbrs_paired_asm",
"paired_asm_comp", "ftp_path", "excluded_from_refseq", "relation_to_type_material"
)
) %>%
select(assembly, refseq_category,
assembly_level, genome_rep,
version_status, release_type) %>%
mutate(
rpseudo = runif(nrow(.), 0, 1)
) -> df
Upvotes: 6