What Baby Names are In and Out of Fashion?

Using open data to summarize baby names over time

Published

September 2, 2023

Open Data of Baby Names

Open Data Buffalo is a great resource and initiative to make datasets open and available to the public.

My partner works at a Children’s hospital and is convinced of trending baby names. Well, I said to her let’s see what the data says!

So I ventured out into the world wide web and found a dataset called:

Baby Names: Beginning 2007

New York State (NYS) Baby Names are aggregated and displayed by the year, county, or borough where the mother resided as stated on a New York State or New York City (NYC) birth certificate. The frequency of the baby name is listed if there are 5 or more of the same baby name in a county outside of NYC or 10 or more of the same baby name in a NYC borough.

library(jsonlite)
library(RSocrata)
suppressMessages(library(tidyverse))

baby_names <- RSocrata::read.socrata("https://health.data.ny.gov/resource/jxy9-yhdk.json",app_token = read_json('.apptoken')[['token']])

baby_names %>% glimpse()
Rows: 87,899
Columns: 5
$ year       <chr> "2007", "2007", "2007", "2007", "2007", "2007", "2007", "20…
$ first_name <chr> "ZOEY", "ZOEY", "ZOEY", "ZOEY", "ZOE", "ZOE", "ZOE", "ZOE",…
$ county     <chr> "KINGS", "SUFFOLK", "MONROE", "ERIE", "ULSTER", "WESTCHESTE…
$ sex        <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
$ name_count <chr> "11", "6", "6", "9", "5", "24", "13", "55", "15", "6", "14"…

This dataset is already tidy: One row per observation (first_name or baby name) and one column per variable (e.g. the number of observed names in a county with the given gender on the birth certificate).

I first check a few data quality characteristics such as missingness and number of unique things in each column:

purrr::map_dfr(baby_names,~{sum(is.na(.x))})
# A tibble: 1 × 5
   year first_name county   sex name_count
  <int>      <int>  <int> <int>      <int>
1     0          0      0     0          0
purrr::map_dfr(baby_names,~{n_distinct(.x)})
# A tibble: 1 × 5
   year first_name county   sex name_count
  <int>      <int>  <int> <int>      <int>
1    14       2320    123     2        244

The baby_names dataset requires initial preprocessing

Now we need to transform our dataset by first converting columns to the appropriate data types:

baby_names_transformed <- 
    baby_names %>% 
    mutate(
        year = as.integer(year),
        first_name = factor(first_name),
        county = factor(county),
        sex = factor(sex,levels=c("M","F"),labels=c("Male","Female")),
        name_count = as.integer(name_count)
    )

baby_names_transformed %>% summary()
      year        first_name        county          sex          name_count    
 Min.   :2007   EMMA   :  528   Kings  : 5715   Male  :46737   Min.   :  5.00  
 1st Qu.:2010   OLIVIA :  503   KINGS  : 4957   Female:41162   1st Qu.:  6.00  
 Median :2014   LOGAN  :  488   Suffolk: 4159                  Median : 11.00  
 Mean   :2013   MASON  :  475   SUFFOLK: 4005                  Mean   : 17.77  
 3rd Qu.:2017   LIAM   :  471   Queens : 3868                  3rd Qu.: 19.00  
 Max.   :2020   JACOB  :  465   Nassau : 3804                  Max.   :297.00  
                (Other):84969   (Other):61391                                  

After the data transformation, we see that counties are specified in different cases. We should revise this so county (and also first_name) are in one type of case such as title case:

baby_names_transformed <- 
    baby_names_transformed %>% 
    mutate(
        first_name = as.character(first_name) %>% stringr::str_to_title() %>% factor(),
        county = as.character(county) %>% stringr::str_to_title() %>% factor()
    )

baby_names_transformed %>% summary()
      year        first_name            county          sex       
 Min.   :2007   Emma   :  528   Kings      :10672   Male  :46737  
 1st Qu.:2010   Olivia :  503   Suffolk    : 8164   Female:41162  
 Median :2014   Logan  :  488   Nassau     : 7352                 
 Mean   :2013   Mason  :  475   Queens     : 7244                 
 3rd Qu.:2017   Liam   :  471   Westchester: 5844                 
 Max.   :2020   Jacob  :  465   Erie       : 5382                 
                (Other):84969   (Other)    :43241                 
   name_count    
 Min.   :  5.00  
 1st Qu.:  6.00  
 Median : 11.00  
 Mean   : 17.77  
 3rd Qu.: 19.00  
 Max.   :297.00  
                 

Many Baby Names Don’t Have Counts In A Year

My next data quality question is how many names have data each year?

theme_set(theme_bw(base_size = 16))

baby_names_transformed %>% 
    summarize(n_years = n_distinct(year),.by=c(first_name)) %>% 
    summarize(n_names = n_distinct(first_name),.by=n_years) %>% 
    ggplot(aes(n_years,n_names)) +
    geom_bar(color="black",fill = "gray80",stat = "identity") +
    scale_x_continuous(breaks = scales::pretty_breaks(14)) +
    scale_y_continuous(expand = c(0,0.1),
                       trans="sqrt",breaks = scales::pretty_breaks(10)) +
    labs(x="Year Data: Number of Years With Name Count Data",
         y="Number of Names With Year Data",
         title="Many Baby Names Don't Have Counts In A Year",
         subtitle="Every Name has Count Data For Atleast One Year") +
    theme(
        panel.grid.major.y = element_line(color="gray75"),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()
    )

Most Baby Names Are Gender-Specific

Another question is are there many names that are unisex i.e. male and female names?

tmp <- 
    baby_names_transformed %>% 
    summarize(n_sex = n_distinct(sex),
              unisex = n_sex==2,onesex = n_sex==1,
              .by=c(first_name)) %>% 
    summarize(`Unisex` = sum(unisex),`One Sex`=sum(onesex))

tmp %>% 
    pivot_longer(cols = everything()) %>% 
    mutate(label = glue::glue("{name} (N={scales::comma(value)})")) %>% 
    ggplot(aes(factor(1),value,fill=label)) +
    geom_bar(stat="identity",position = "fill") +
    scale_fill_brewer(palette = "Dark2") +
    scale_y_continuous(labels = scales::percent) +
    guides(fill=guide_legend(title=NULL)) +
    labs(x=NULL,y="Percent of Names",title="Most Baby Names Are Gender-Specific",subtitle = "There Are A Few Names That Are Unisex, However") +
    theme(
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        legend.position = "top"
    )

Most Baby Names are Counted in a Few NYS Counties

My last data quality question is how many baby names have data across counties?

baby_names_transformed %>% 
    summarise(n_counties = n_distinct(county),.by=first_name) %>% 
    bind_cols(
        summarise(baby_names,total_counties = n_distinct(county))
    ) %>% 
    mutate(
        freq_counties = n_counties / total_counties
    ) %>% 
    ggplot(aes(freq_counties,y=after_stat(count))) +
    geom_density(bw="nrd",color="blue",fill="cornflowerblue") +
    scale_x_continuous(labels = scales::percent) +
    scale_y_sqrt(breaks = scales::pretty_breaks(15)) +
    labs(x="Percent of Counties",y="Number of Names With County Data",
         title="Most Names are Counted in a Few NYS Counties") +
    theme(
        panel.grid.major.y = element_line(color="gray75"),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()
    )

After transforming the dataset and checking data quality, we see that:

  1. Baby names are sparsely annotated across counties
  2. Baby names are, generally, specific to a year or are observed across all years
  3. There are a few baby names that are not gender-specific.