What Baby Names are In and Out of Fashion?

Using open data to summarize baby names over time

Published

October 19, 2025

[October 18t 2025 update - replacing arcived {RSocrata} with {socratadata}]

Open Data of Baby Names

Open Data Buffalo is a great resource and initiative to make datasets open and available to the public.

My partner works at a Children’s hospital and is convinced of trending baby names. Well, I said to her let’s see what the data says!

So I ventured out into the world wide web and found a dataset called:

Baby Names: Beginning 2007

New York State (NYS) Baby Names are aggregated and displayed by the year, county, or borough where the mother resided as stated on a New York State or New York City (NYC) birth certificate. The frequency of the baby name is listed if there are 5 or more of the same baby name in a county outside of NYC or 10 or more of the same baby name in a NYC borough.

library(jsonlite)
suppressMessages(library(tidyverse))

baby_names <- 
    socratadata::soc_read(
      url = "https://health.data.ny.gov/resource/jxy9-yhdk.json",
      page_size = 1e7
    )
ℹ Utilizing v2.1 API. `include_synthetic_cols` will be ignored. Provide an `api_key_id` and `api_key_secret` to perform a v3 request.
baby_names %>% glimpse()
Rows: 99,116
Columns: 5
$ year       <chr> "2022", "2022", "2022", "2022", "2022", "2022", "2022", "20…
$ first_name <chr> "OLIVIA", "AMELIA", "AVERY", "EMMA", "CHARLOTTE", "CHLOE", …
$ county     <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "Albany",…
$ sex        <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
$ name_count <dbl> 16, 15, 12, 11, 11, 11, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…

This dataset is already tidy: One row per observation (first_name or baby name) and one column per variable (e.g. the number of observed names in a county with the given gender on the birth certificate).

I first check a few data quality characteristics such as missingness and number of unique things in each column:

purrr::map_dfr(baby_names,~{sum(is.na(.x))})
# A tibble: 1 × 5
   year first_name county   sex name_count
  <int>      <int>  <int> <int>      <int>
1     0          0      0     0          0
purrr::map_dfr(baby_names,~{n_distinct(.x)})
# A tibble: 1 × 5
   year first_name county   sex name_count
  <int>      <int>  <int> <int>      <int>
1    16       2393     61     2        244

The baby_names dataset requires initial preprocessing

Now we need to transform our dataset by first converting columns to the appropriate data types:

baby_names_transformed <- 
    baby_names %>% 
    mutate(
        year = as.integer(year),
        first_name = factor(first_name),
        county = factor(county),
        sex = factor(sex,levels=c("M","F"),labels=c("Male","Female")),
        name_count = as.integer(name_count)
    )

baby_names_transformed %>% summary()
      year        first_name            county          sex       
 Min.   :2007   EMMA   :  582   Kings      :11932   Male  :52728  
 1st Qu.:2010   OLIVIA :  562   Suffolk    : 9439   Female:46388  
 Median :2014   LOGAN  :  536   Nassau     : 8488                 
 Mean   :2014   LIAM   :  535   Queens     : 8038                 
 3rd Qu.:2018   MASON  :  522   Westchester: 6691                 
 Max.   :2022   NOAH   :  520   Erie       : 6160                 
                (Other):95859   (Other)    :48368                 
   name_count    
 Min.   :  5.00  
 1st Qu.:  6.00  
 Median : 11.00  
 Mean   : 17.57  
 3rd Qu.: 19.00  
 Max.   :297.00  
                 

After the data transformation, we see that counties are specified in different cases. We should revise this so county (and also first_name) are in one type of case such as title case:

baby_names_transformed <- 
    baby_names_transformed %>% 
    mutate(
        first_name = as.character(first_name) %>% stringr::str_to_title() %>% factor(),
        county = as.character(county) %>% stringr::str_to_title() %>% factor()
    )

baby_names_transformed %>% summary()
      year        first_name            county          sex       
 Min.   :2007   Emma   :  582   Kings      :11932   Male  :52728  
 1st Qu.:2010   Olivia :  562   Suffolk    : 9439   Female:46388  
 Median :2014   Logan  :  536   Nassau     : 8488                 
 Mean   :2014   Liam   :  535   Queens     : 8038                 
 3rd Qu.:2018   Mason  :  522   Westchester: 6691                 
 Max.   :2022   Noah   :  520   Erie       : 6160                 
                (Other):95859   (Other)    :48368                 
   name_count    
 Min.   :  5.00  
 1st Qu.:  6.00  
 Median : 11.00  
 Mean   : 17.57  
 3rd Qu.: 19.00  
 Max.   :297.00  
                 

Many Baby Names Don’t Have Counts In A Year

My next data quality question is how many names have data each year?

theme_set(theme_bw(base_size = 16))

baby_names_transformed %>% 
    summarize(n_years = n_distinct(year),.by=c(first_name)) %>% 
    summarize(n_names = n_distinct(first_name),.by=n_years) %>% 
    ggplot(aes(n_years,n_names)) +
    geom_bar(color="black",fill = "gray80",stat = "identity") +
    scale_x_continuous(breaks = scales::pretty_breaks(14)) +
    scale_y_continuous(expand = c(0,0.1),
                       trans="sqrt",breaks = scales::pretty_breaks(10)) +
    labs(x="Year Data: Number of Years With Name Count Data",
         y="Number of Names With Year Data",
         title="Many Baby Names Don't Have Counts In A Year",
         subtitle="Every Name has Count Data For Atleast One Year") +
    theme(
        panel.grid.major.y = element_line(color="gray75"),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()
    )

Most Baby Names Are Gender-Specific

Another question is are there many names that are unisex i.e. male and female names?

tmp <- 
    baby_names_transformed %>% 
    summarize(n_sex = n_distinct(sex),
              unisex = n_sex==2,onesex = n_sex==1,
              .by=c(first_name)) %>% 
    summarize(`Unisex` = sum(unisex),`One Sex`=sum(onesex))

tmp %>% 
    pivot_longer(cols = everything()) %>% 
    mutate(label = glue::glue("{name} (N={scales::comma(value)})")) %>% 
    ggplot(aes(factor(1),value,fill=label)) +
    geom_bar(stat="identity",position = "fill") +
    scale_fill_brewer(palette = "Dark2") +
    scale_y_continuous(labels = scales::percent) +
    guides(fill=guide_legend(title=NULL)) +
    labs(x=NULL,y="Percent of Names",title="Most Baby Names Are Gender-Specific",subtitle = "There Are A Few Names That Are Unisex, However") +
    theme(
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        legend.position = "top"
    )

Most Baby Names are Counted in a Few NYS Counties

My last data quality question is how many baby names have data across counties?

baby_names_transformed %>% 
    summarise(n_counties = n_distinct(county),.by=first_name) %>% 
    bind_cols(
        summarise(baby_names,total_counties = n_distinct(county))
    ) %>% 
    mutate(
        freq_counties = n_counties / total_counties
    ) %>% 
    ggplot(aes(freq_counties,y=after_stat(count))) +
    geom_density(bw="nrd",color="blue",fill="cornflowerblue") +
    scale_x_continuous(labels = scales::percent) +
    scale_y_sqrt(breaks = scales::pretty_breaks(15)) +
    labs(x="Percent of Counties",y="Number of Names With County Data",
         title="Most Names are Counted in a Few NYS Counties") +
    theme(
        panel.grid.major.y = element_line(color="gray75"),
        panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank()
    )

After transforming the dataset and checking data quality, we see that:

  1. Baby names are sparsely annotated across counties
  2. Baby names are, generally, specific to a year or are observed across all years
  3. There are a few baby names that are not gender-specific.