Open Data Buffalo is a great resource and initiative to make datasets open and available to the public.
My partner works at a Children’s hospital and is convinced of trending baby names. Well, I said to her let’s see what the data says!
So I ventured out into the world wide web and found a dataset called:
Baby Names: Beginning 2007
New York State (NYS) Baby Names are aggregated and displayed by the year, county, or borough where the mother resided as stated on a New York State or New York City (NYC) birth certificate. The frequency of the baby name is listed if there are 5 or more of the same baby name in a county outside of NYC or 10 or more of the same baby name in a NYC borough.
This dataset is already tidy: One row per observation (first_name or baby name) and one column per variable (e.g. the number of observed names in a county with the given gender on the birth certificate).
I first check a few data quality characteristics such as missingness and number of unique things in each column:
purrr::map_dfr(baby_names,~{sum(is.na(.x))})
# A tibble: 1 × 5
year first_name county sex name_count
<int> <int> <int> <int> <int>
1 0 0 0 0 0
purrr::map_dfr(baby_names,~{n_distinct(.x)})
# A tibble: 1 × 5
year first_name county sex name_count
<int> <int> <int> <int> <int>
1 14 2320 123 2 244
The baby_names dataset requires initial preprocessing
Now we need to transform our dataset by first converting columns to the appropriate data types:
year first_name county sex name_count
Min. :2007 EMMA : 528 Kings : 5715 Male :46737 Min. : 5.00
1st Qu.:2010 OLIVIA : 503 KINGS : 4957 Female:41162 1st Qu.: 6.00
Median :2014 LOGAN : 488 Suffolk: 4159 Median : 11.00
Mean :2013 MASON : 475 SUFFOLK: 4005 Mean : 17.77
3rd Qu.:2017 LIAM : 471 Queens : 3868 3rd Qu.: 19.00
Max. :2020 JACOB : 465 Nassau : 3804 Max. :297.00
(Other):84969 (Other):61391
After the data transformation, we see that counties are specified in different cases. We should revise this so county (and also first_name) are in one type of case such as title case:
year first_name county sex
Min. :2007 Emma : 528 Kings :10672 Male :46737
1st Qu.:2010 Olivia : 503 Suffolk : 8164 Female:41162
Median :2014 Logan : 488 Nassau : 7352
Mean :2013 Mason : 475 Queens : 7244
3rd Qu.:2017 Liam : 471 Westchester: 5844
Max. :2020 Jacob : 465 Erie : 5382
(Other):84969 (Other) :43241
name_count
Min. : 5.00
1st Qu.: 6.00
Median : 11.00
Mean : 17.77
3rd Qu.: 19.00
Max. :297.00
Many Baby Names Don’t Have Counts In A Year
My next data quality question is how many names have data each year?
theme_set(theme_bw(base_size =16))baby_names_transformed %>%summarize(n_years =n_distinct(year),.by=c(first_name)) %>%summarize(n_names =n_distinct(first_name),.by=n_years) %>%ggplot(aes(n_years,n_names)) +geom_bar(color="black",fill ="gray80",stat ="identity") +scale_x_continuous(breaks = scales::pretty_breaks(14)) +scale_y_continuous(expand =c(0,0.1),trans="sqrt",breaks = scales::pretty_breaks(10)) +labs(x="Year Data: Number of Years With Name Count Data",y="Number of Names With Year Data",title="Many Baby Names Don't Have Counts In A Year",subtitle="Every Name has Count Data For Atleast One Year") +theme(panel.grid.major.y =element_line(color="gray75"),panel.grid.minor.y =element_blank(),panel.grid.major.x =element_blank(),panel.grid.minor.x =element_blank() )
Most Baby Names Are Gender-Specific
Another question is are there many names that are unisex i.e. male and female names?
tmp <- baby_names_transformed %>%summarize(n_sex =n_distinct(sex),unisex = n_sex==2,onesex = n_sex==1,.by=c(first_name)) %>%summarize(`Unisex`=sum(unisex),`One Sex`=sum(onesex))tmp %>%pivot_longer(cols =everything()) %>%mutate(label = glue::glue("{name} (N={scales::comma(value)})")) %>%ggplot(aes(factor(1),value,fill=label)) +geom_bar(stat="identity",position ="fill") +scale_fill_brewer(palette ="Dark2") +scale_y_continuous(labels = scales::percent) +guides(fill=guide_legend(title=NULL)) +labs(x=NULL,y="Percent of Names",title="Most Baby Names Are Gender-Specific",subtitle ="There Are A Few Names That Are Unisex, However") +theme(axis.ticks.x =element_blank(),axis.text.x =element_blank(),legend.position ="top" )
Most Baby Names are Counted in a Few NYS Counties
My last data quality question is how many baby names have data across counties?
baby_names_transformed %>%summarise(n_counties =n_distinct(county),.by=first_name) %>%bind_cols(summarise(baby_names,total_counties =n_distinct(county)) ) %>%mutate(freq_counties = n_counties / total_counties ) %>%ggplot(aes(freq_counties,y=after_stat(count))) +geom_density(bw="nrd",color="blue",fill="cornflowerblue") +scale_x_continuous(labels = scales::percent) +scale_y_sqrt(breaks = scales::pretty_breaks(15)) +labs(x="Percent of Counties",y="Number of Names With County Data",title="Most Names are Counted in a Few NYS Counties") +theme(panel.grid.major.y =element_line(color="gray75"),panel.grid.minor.y =element_blank(),panel.grid.major.x =element_blank(),panel.grid.minor.x =element_blank() )
After transforming the dataset and checking data quality, we see that:
Baby names are sparsely annotated across counties
Baby names are, generally, specific to a year or are observed across all years
There are a few baby names that are not gender-specific.
What are the trending baby names in NYS?
I want to ask the question what are the trending baby names in NYS? My question is not specific to the gender or county. Also, my question wants to consider all the baby names in the dataset as possible influencing factors in which baby names are trending. Because of the missing count data, we should use a model to estimate the trend based on the available data. The model we will use is a generalized additive model (GAM) to predict the Poisson distribution count of the baby name. *
We can estimate baby name trends using generalized additive models
Let’s do a small example first. My subquestion is, is the baby name Charlotte trending over time irrespective of other factors?
library(mgcv)
Loading required package: nlme
Attaching package: 'nlme'
The following object is masked from 'package:dplyr':
collapse
This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.
tmp <- baby_names_transformed %>%filter(first_name=="Charlotte")form <-as.formula(glue::glue("name_count ~ s(year) + county:year"))fit <- mgcv::gam(form,family="poisson",data=tmp,method ="GACV.Cp")tmp %>%bind_cols(.pred = fit$fitted.values) %>%ggplot(aes(name_count,.pred)) +geom_point(shape=21) +geom_smooth(formula ='y ~ x',method="lm") +labs(x="Annotated Counts",y="Predicted Counts",title="Model Predictions Look Pretty Accurate With Not Much Bias")
coefs_ <- fit$coefficients[str_detect(names(fit$coefficients),"s\\(year\\)")] %>%unname()tibble(year =seq_along(coefs_),coef = coefs_) %>%ggplot(aes(year,coef)) +geom_line(linewidth=2) +labs(x="Time",y="Weight",title="Charlotte is predicted as falling in and out of fashion")
GAMs allow for estimating trends given “random” naming patterns across NYS counties over time
If we want to consider all names, we need to specify the name as a random effect and the trend line as a random slope between the name and the year:
We can use this formula in our model to get the trends of the baby names over time (Note: we switch from gam to bam so that we can fit the model faster using method = ‘fREML’):