Dail
Dail

Reputation: 4608

Gender detection by full name

I want to create a model that detects the gender based on a full name. I have two dictionaries with male & female names. I want to develop a model to classify previously unseen names.

I need to determine the gender after the NER (name entity recognition) process. This delivers a PERSON entity with any one of these characteristics:

I can do male vs female determination on (given) name only. The model needs to handle SURNAME only, classifying it as NO_GENDER.

I know that surnames can be noisy, but I must deal with them, because they could be a part of the input.

Upvotes: 0

Views: 1277

Answers (1)

Prune
Prune

Reputation: 77880

First, pre-process the data: in a full-name input, keep only the name (see below). Apply this to unknown input as well.

I suggest that you train a multi-class SVM. You already know the three classes. Make up the following training (labeled) data:

  • NO_GENDER: names on both the girls' and boys' lists
  • FEMALE: names on only the girls' list
  • MALE: names on only the boys' list
  • NO_GENDER: known surnames
  • NO_GENDER: non-name strings

Essentially,you train this to recognize FEMALE, MALE, and everything else.

PREPROCESS

This will give you some troubles, due to varying name formats. You may have trouble with compound names, such as

Bobby Jo             male name with female modifier
van der Waal         compound surname with male-looking prefix
St. John             surname with gendered primary
Haley-Christopher    hyphenated surname, genedered

If you pre-process the inputs, you may have some trouble spotting the proper division in, say, Billy Jean St. John or Marie-Therese von Klaus.

Upvotes: 1

Related Questions