Reputation: 59
i want to know how I can get all the value that could not be processed with the gender package. Please have a look at the following code:
library(gender)
test = tibble::tribble(
~Name1,
"Peter",
"Susan",
"Nuernberg",
"Test",
"Heiko",
"He"
)
test$Name1 <- as.character(test$Name1)
genderpred = gender(test$Name1, method = "ssa")
Created on 2021-06-03 by the reprex package (v2.0.0)
As you can see, genderpred does not contain the not known gender. How can I get them into the matrix with 'NA'.
Thanks for your help!
Upvotes: 0
Views: 359
Reputation: 3326
Given test
as a "matrix" (really a tibble
) of names, you can simply use dplyr::right_join()
as follows
library(gender)
library(dplyr)
# ...
# Your code to get the 'test' dataset of names.
# ...
# Consolidate any names (Name1, Name2, ...) into a single column.
consolidated <- data.frame(all_names = as.character(as.vector(as.matrix(test))))
# Get the gender predictions.
genderpred <- gender(consolidated$all_names, method = "ssa")
# Perform the join using the consolidated names.
genderpred <- genderpred %>%
right_join(consolidated, by = c("name" = "all_names"))
to get your desired result for genderpred
, like:
name proportion_male proportion_female gender year_min year_max
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Peter 0.995 0.0053 male 1985 1985
2 Susan 0.0067 0.993 female 1985 1985
3 Nuernberg NA NA NA NA NA
4 Test NA NA NA NA NA
5 Heiko NA NA NA NA NA
6 He NA NA NA NA NA
By using a right_join
, you include all the names from test
: not just those with a matching name
in genderpred
. When such a name (like "Nuernberg"
) has no match, it populates a new row that is "blank" (filled with NA
s).
The dplyr
documentation for join
s can be found here.
Per the poster's request, I have extended the code (above) to handle multiple name columns in test
. As such, an initial test
dataset like
test <- tibble::tribble(
~Name1, ~Name2, # ... ~Name_n
"Peter", "Gary", # ... .
"Susan", "Mary", # ... .
"Nuernberg", "Heisenberg", # ... .
"Test", "And", # ... .
"Heiko", "So", # ... .
"He", "Forth" # ... .
)
will give a result for genderpred
like
name proportion_male proportion_female gender year_min year_max
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Gary 0.996 0.0035 male 1932 2012
2 Mary 0.0038 0.996 female 1932 2012
3 Peter 0.997 0.0032 male 1932 2012
4 Susan 0.0023 0.998 female 1932 2012
5 Nuernberg NA NA NA NA NA
6 Test NA NA NA NA NA
7 Heiko NA NA NA NA NA
8 He NA NA NA NA NA
9 Heisenberg NA NA NA NA NA
10 And NA NA NA NA NA
11 So NA NA NA NA NA
12 Forth NA NA NA NA NA
which can then be filtered (dplyr::filter()
) and sorted (dplyr::arrange()
) as desired.
Upvotes: 2