Max H.
Max H.

Reputation: 59

Getting 'NA' for not finding gender in gender package in R

i want to know how I can get all the value that could not be processed with the gender package. Please have a look at the following code:

library(gender)
test = tibble::tribble(
              ~Name1,
             "Peter",
             "Susan",
         "Nuernberg",
              "Test",
             "Heiko",
                "He"
         )
test$Name1 <- as.character(test$Name1) 
genderpred = gender(test$Name1, method = "ssa")

Created on 2021-06-03 by the reprex package (v2.0.0)

As you can see, genderpred does not contain the not known gender. How can I get them into the matrix with 'NA'.

Thanks for your help!

Upvotes: 0

Views: 359

Answers (1)

Greg
Greg

Reputation: 3326

Given test as a "matrix" (really a tibble) of names, you can simply use dplyr::right_join() as follows

library(gender)
library(dplyr)

# ...
# Your code to get the 'test' dataset of names.
# ...

# Consolidate any names (Name1, Name2, ...) into a single column.
consolidated <- data.frame(all_names = as.character(as.vector(as.matrix(test))))

# Get the gender predictions.
genderpred <- gender(consolidated$all_names, method = "ssa")

# Perform the join using the consolidated names.
genderpred <- genderpred %>%
  right_join(consolidated, by = c("name" = "all_names"))

to get your desired result for genderpred, like:

  name    proportion_male proportion_female gender year_min year_max
  <chr>             <dbl>             <dbl> <chr>     <dbl>    <dbl>
1 Peter            0.995             0.0053 male       1985     1985
2 Susan            0.0067            0.993  female     1985     1985
3 Nuernberg       NA                NA      NA           NA       NA
4 Test            NA                NA      NA           NA       NA
5 Heiko           NA                NA      NA           NA       NA
6 He              NA                NA      NA           NA       NA

By using a right_join, you include all the names from test: not just those with a matching name in genderpred. When such a name (like "Nuernberg") has no match, it populates a new row that is "blank" (filled with NAs).

The dplyr documentation for joins can be found here.

Update

Per the poster's request, I have extended the code (above) to handle multiple name columns in test. As such, an initial test dataset like

test <- tibble::tribble(
  ~Name1,      ~Name2,       # ...  ~Name_n
  "Peter",     "Gary",       # ...     .
  "Susan",     "Mary",       # ...     .
  "Nuernberg", "Heisenberg", # ...     .
  "Test",      "And",        # ...     .
  "Heiko",     "So",         # ...     .
  "He",        "Forth"       # ...     .
)

will give a result for genderpred like

   name       proportion_male proportion_female gender year_min year_max
   <chr>                <dbl>             <dbl> <chr>     <dbl>    <dbl>
 1 Gary                0.996             0.0035 male       1932     2012
 2 Mary                0.0038            0.996  female     1932     2012
 3 Peter               0.997             0.0032 male       1932     2012
 4 Susan               0.0023            0.998  female     1932     2012
 5 Nuernberg          NA                NA      NA           NA       NA
 6 Test               NA                NA      NA           NA       NA
 7 Heiko              NA                NA      NA           NA       NA
 8 He                 NA                NA      NA           NA       NA
 9 Heisenberg         NA                NA      NA           NA       NA
10 And                NA                NA      NA           NA       NA
11 So                 NA                NA      NA           NA       NA
12 Forth              NA                NA      NA           NA       NA

which can then be filtered (dplyr::filter()) and sorted (dplyr::arrange()) as desired.

Upvotes: 2

Related Questions