ZSH
ZSH

Reputation: 652

Classifying Gender (and likely age range) from First Name

After trying to upload a dataset (as a CSV) to H2O, and finding that the FirstName column gets converted to null/missing, I learned that the current version of H2O doesn't support columns of class string, and factors only go up to 65k unique values. So now I'm looking for another way to solve this problem.

I want to end with a model that, given any FirstName, will return:

Which R functions (or packages::functions) would work for this? Preferably well-documented packages/functions so I can learn more as I go.

Here's a sample of the dataset in R. The column types are: Numerical, factor, factor, numerical.

> head(TrainingNames)

  Year FirstName Gender Freq
1 1880      Mary      F 7065
2 1880      Anna      F 2604
3 1880      Emma      F 2003
4 1880 Elizabeth      F 1939
5 1880    Minnie      F 1746
6 1880  Margaret      F 1578

> summary(TrainingNames)

      Year        FirstName       Gender           Freq        
 Min.   :1880   Francis:    268   F:1062432   Min.   :    5.0  
 1st Qu.:1948   James  :    268   M: 729659   1st Qu.:    7.0  
 Median :1981   Jean   :    268               Median :   12.0  
 Mean   :1972   Jesse  :    268               Mean   :  186.1  
 3rd Qu.:2000   Jessie :    268               3rd Qu.:   32.0  
 Max.   :2013   John   :    268               Max.   :99674.0  
                (Other):1790483                                

Here's R code to pull/process the data-source.

# Create data dir, download and extract data source
dir.create('Data Files', showWarnings = F)
if(!file.exists('Data Files/names.zip')) {
  download.file(url = 'http://www.ssa.gov/oact/babynames/names.zip', destfile = 'Data Files/names.zip', cacheOK = T)
  setwd('Data Files/')
  unzip(zipfile = 'names.zip')
  setwd('../') 
}

FileList <- list.files(path = "Data Files/", pattern = ".txt") # List of data files

# Create data-source of names for R/Tableau

munge <- function(f) { # Return data frame of single data file
  y <- as.numeric(gsub(pattern = '[^0-9]', replacement = "", x = f))
  l <- read.csv(file = paste0("Data Files/", f), header = F, quote = "'")
  d <- cbind(y, l)
  colnames(d) <- c("Year", "FirstName", "Gender", "Freq")
  return(data.frame(d))
}

if(!file.exists('TrainingNames.csv')) {
  pb <- txtProgressBar(min = 1, max = length(FileList), style = 3) # Start progress bar

  TrainingNames <- munge(FileList[[1]]) # Munge first data file
  for(n in 2:length(FileList)) { # Munge remaining data files
    TrainingNames <- rbind(TrainingNames, munge(FileList[[n]]))
    setTxtProgressBar(pb, n)
  }

  close(pb) # Close progress bar
  rm(n, pb)

  write.table(x = TrainingNames, file = "TrainingNames.csv", sep = ";", row.names = F, col.names = T) # Write results to CSV file
}

summary(TrainingNames)

Upvotes: 0

Views: 1263

Answers (1)

CephBirk
CephBirk

Reputation: 6720

Here I've defined a function name_stats that does as you request. You'll need to run the code in your question to create TrainingNames first before the function will work.

You can edit whatever you like to make it fit your specific needs.

name_stats=function(name){
    df=subset(TrainingNames,FirstName==name)
    gender=tapply(df[,'Freq'],df[,'Gender'],sum)
    prob_male=gender['M']/sum(gender)
    prob_female=gender['F']/sum(gender)
    age=tapply(df[,'Freq'],as.factor(df[,'Year']),sum)
    dimnames(age)=list(age=round((Sys.Date()-as.Date(unlist(dimnames(age)),format='%Y'))/365))
    mean_age=mean(rep(as.numeric(unlist(dimnames(age))),age))
    sd_age=sd(rep(as.numeric(unlist(dimnames(age))),age))
    cat('Probability',name,'is male is',round(prob_male,6),'\n','Probability',name,'is female is',round(prob_female,6),'\n','Mean age of',name,'is',round(mean_age,6),'\n','SD age of',name,'is',round(sd_age,6))
}

Upvotes: 1

Related Questions