PageSim
PageSim

Reputation: 143

Question on filtering down a large dataset

In the problem here, I have a data set of popular baby names going back to 1880. I am trying to find the timelessly popular baby names, meaning the 30 most common names for its gender in every year of my data.

I have tried using group_by, top_n, and filter, but just am not very well verse with the program yet, so unsure how the proper order and thinking goes here.

library(babynames)

timeless <- babynames %>% group_by(name, sex, year) %>% top_n(30) %>% filter()

I am getting a large data table back with the 30 most common names for each year of data, but I want to compare that to find the most common names in every year. My prof hinted that there should be four timeless boy names, and one timeless girl name. Any help is appreciated!

Upvotes: 2

Views: 153

Answers (1)

www
www

Reputation: 39174

Here is the answer.

library(babynames)
library(dplyr)

timeless <- babynames %>% 
  group_by(sex, year) %>% 
  top_n(30) %>%
  ungroup() %>%
  count(sex, name) %>%
  filter(n == max(babynames$year) - min(babynames$year) + 1)

timeless
# # A tibble: 5 x 3
#   sex   name          n
#   <chr> <chr>     <int>
# 1 F     Elizabeth   138
# 2 M     James       138
# 3 M     John        138
# 4 M     Joseph      138
# 5 M     William     138

Regarding your original code, group_by(name, sex, year) %>% top_n(30) does not make sense as all combination of name, sex, and year are unique, thus nothing for you to filer the "top 30".

Upvotes: 1

Related Questions