Reputation: 652
After trying to upload a dataset (as a CSV) to H2O, and finding that the FirstName column gets converted to null/missing, I learned that the current version of H2O doesn't support columns of class string, and factors only go up to 65k unique values. So now I'm looking for another way to solve this problem.
I want to end with a model that, given any FirstName, will return:
Which R functions (or packages::functions) would work for this? Preferably well-documented packages/functions so I can learn more as I go.
Here's a sample of the dataset in R. The column types are: Numerical, factor, factor, numerical.
> head(TrainingNames)
Year FirstName Gender Freq
1 1880 Mary F 7065
2 1880 Anna F 2604
3 1880 Emma F 2003
4 1880 Elizabeth F 1939
5 1880 Minnie F 1746
6 1880 Margaret F 1578
> summary(TrainingNames)
Year FirstName Gender Freq
Min. :1880 Francis: 268 F:1062432 Min. : 5.0
1st Qu.:1948 James : 268 M: 729659 1st Qu.: 7.0
Median :1981 Jean : 268 Median : 12.0
Mean :1972 Jesse : 268 Mean : 186.1
3rd Qu.:2000 Jessie : 268 3rd Qu.: 32.0
Max. :2013 John : 268 Max. :99674.0
(Other):1790483
Here's R code to pull/process the data-source.
# Create data dir, download and extract data source
dir.create('Data Files', showWarnings = F)
if(!file.exists('Data Files/names.zip')) {
download.file(url = 'http://www.ssa.gov/oact/babynames/names.zip', destfile = 'Data Files/names.zip', cacheOK = T)
setwd('Data Files/')
unzip(zipfile = 'names.zip')
setwd('../')
}
FileList <- list.files(path = "Data Files/", pattern = ".txt") # List of data files
# Create data-source of names for R/Tableau
munge <- function(f) { # Return data frame of single data file
y <- as.numeric(gsub(pattern = '[^0-9]', replacement = "", x = f))
l <- read.csv(file = paste0("Data Files/", f), header = F, quote = "'")
d <- cbind(y, l)
colnames(d) <- c("Year", "FirstName", "Gender", "Freq")
return(data.frame(d))
}
if(!file.exists('TrainingNames.csv')) {
pb <- txtProgressBar(min = 1, max = length(FileList), style = 3) # Start progress bar
TrainingNames <- munge(FileList[[1]]) # Munge first data file
for(n in 2:length(FileList)) { # Munge remaining data files
TrainingNames <- rbind(TrainingNames, munge(FileList[[n]]))
setTxtProgressBar(pb, n)
}
close(pb) # Close progress bar
rm(n, pb)
write.table(x = TrainingNames, file = "TrainingNames.csv", sep = ";", row.names = F, col.names = T) # Write results to CSV file
}
summary(TrainingNames)
Upvotes: 0
Views: 1263
Reputation: 6720
Here I've defined a function name_stats
that does as you request. You'll need to run the code in your question to create TrainingNames first before the function will work.
You can edit whatever you like to make it fit your specific needs.
name_stats=function(name){
df=subset(TrainingNames,FirstName==name)
gender=tapply(df[,'Freq'],df[,'Gender'],sum)
prob_male=gender['M']/sum(gender)
prob_female=gender['F']/sum(gender)
age=tapply(df[,'Freq'],as.factor(df[,'Year']),sum)
dimnames(age)=list(age=round((Sys.Date()-as.Date(unlist(dimnames(age)),format='%Y'))/365))
mean_age=mean(rep(as.numeric(unlist(dimnames(age))),age))
sd_age=sd(rep(as.numeric(unlist(dimnames(age))),age))
cat('Probability',name,'is male is',round(prob_male,6),'\n','Probability',name,'is female is',round(prob_female,6),'\n','Mean age of',name,'is',round(mean_age,6),'\n','SD age of',name,'is',round(sd_age,6))
}
Upvotes: 1