Identifying Open Data about the City of Buffalo and Western New York

Starting the journey to make some interesting and fun insights about my hometown by identifying open datasets

Published

July 31, 2023

Open Data Buffalo is a great resource and initiative to make datasets open and available to the public. My goal looking here is to identify and download available datasets about my hometown. This is the first leg of the journey in order to perform analyses or create web applications that do something interesting, fun, and possibly useful with data from my hometown. Along thhe way, I hope to learn more about APIs, Shiny development, and knowledge about the place I call home.

Let’s get started!

I first made an account with Tyler Data and Insights. Not sure why, but I may learn why later on.

Second, I searched for datasets and sorted by most accessed datasets.

Screenshot of search filter and sorted site

Screenshot of search filter and sorted site

And after scrolling through a few pages, I’m curious to look at the Active Corporations dataset.

Screenshot of Active Corporations dataset webpage

Screenshot of Active Corporations dataset webpage

Luckily, this dataset has a few options to access the data:

Screenshot of API options

Screenshot of API options

And below is some information about the dataset on the webpage. It has 3.6 million rows and 30 columns.

Screenshot of Active Corporations dataset description

Screenshot of Active Corporations dataset description

Let’s see what happens when I try to view this dataset.

library(jsonlite)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
jobj <- jsonlite::read_json("https://data.ny.gov/resource/n9v6-gdp6.json",simplifyVector = TRUE)

jobj %>% dplyr::glimpse()
Rows: 1,000
Columns: 30
$ dos_id                     <chr> "4072354", "6221345", "4265810", "1330171",…
$ current_entity_name        <chr> "GRACE SEAFOOD CORP.", "KV CATALYST IM, LLC…
$ initial_dos_filing_date    <chr> "2011-03-24T00:00:00.000", "2021-07-15T00:0…
$ county                     <chr> "Bronx", "New York", "Queens", "Cayuga", "B…
$ jurisdiction               <chr> "New York", "Delaware", "New York", "New Yo…
$ entity_type                <chr> "DOMESTIC BUSINESS CORPORATION", "FOREIGN L…
$ dos_process_name           <chr> "GRACE SEAFOOD CORP.", "KV CATALYST IM, LLC…
$ dos_process_address_1      <chr> "1226 EAST 183RD STREET", "2373 BROADWAY, A…
$ dos_process_city           <chr> "BRONX", "NEW YORK", "MIDDLE VILLAGE", "AUB…
$ dos_process_state          <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "…
$ dos_process_zip            <chr> "10453", "10024", "11379", "13021", "10467"…
$ chairman_name              <chr> NA, NA, NA, "GARY J. CUNNINGHAM, JR.", NA, …
$ chairman_address_1         <chr> NA, NA, NA, "GARY J. CUNNINGHAM, JR.", NA, …
$ chairman_city              <chr> NA, NA, NA, "AUBURN", NA, "BRONX", "ATLANTI…
$ chairman_state             <chr> NA, NA, NA, "NY", NA, "NY", "NY", "NY", NA,…
$ chairman_zip               <chr> NA, NA, NA, "130210251", NA, "10469", "1150…
$ location_name              <chr> NA, NA, NA, "YAWGER BROOK BAKES, INC.", NA,…
$ location_address_1         <chr> NA, NA, NA, "C/O GARY J. CUNNINGHAM, JR.", …
$ location_address_2         <chr> NA, NA, NA, "323 CLARK ST", NA, "1ST FL", N…
$ location_city              <chr> NA, NA, NA, "AUBURN", NA, "BRONX", "ATLANTI…
$ location_state             <chr> NA, NA, NA, "NY", NA, "NY", "NY", "NY", NA,…
$ location_zip               <chr> NA, NA, NA, "13021", NA, "10469", "11509", …
$ dos_process_address_2      <chr> NA, NA, NA, NA, NA, "1ST FL", NA, NA, NA, "…
$ chairman_address_2         <chr> NA, NA, NA, NA, NA, "1ST FL", NA, NA, NA, N…
$ registered_agent_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ registered_agent_address_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ registered_agent_city      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ registered_agent_state     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ registered_agent_zip       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ registered_agent_address_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Interesting! I may come back to this but let’s see what another dataset looks like. The dataset description is on the site below.

Screenshot of Inpatient Discharge dataset

Screenshot of Inpatient Discharge dataset.
jobj <- jsonlite::read_json("https://health.data.ny.gov/resource/u4ud-w55t.json",simplifyVector = TRUE)

jobj %>% dplyr::glimpse()
Rows: 1,000
Columns: 34
$ hospital_service_area               <chr> "Capital/Adirond", "Capital/Adiron…
$ hospital_county                     <chr> "Albany", "Albany", "Albany", "Alb…
$ operating_certificate_number        <chr> "0101000", "0101000", "0101000", "…
$ facility_id                         <chr> "1", "1", "1", "1", "1", "1", "1",…
$ facility_name                       <chr> "Albany Medical Center Hospital", …
$ age_group                           <chr> "0 to 17", "0 to 17", "18 to 29", …
$ zip_code_3_digits                   <chr> "121", "121", "120", "121", "124",…
$ gender                              <chr> "F", "F", "F", "F", "M", "F", "M",…
$ race                                <chr> "Other Race", "White", "White", "W…
$ ethnicity                           <chr> "Not Span/Hispanic", "Not Span/His…
$ length_of_stay                      <chr> "1", "1", "3", "2", "3", "3", "3",…
$ type_of_admission                   <chr> "Urgent", "Emergency", "Urgent", "…
$ patient_disposition                 <chr> "Home or Self Care", "Home or Self…
$ discharge_year                      <chr> "2012", "2012", "2012", "2012", "2…
$ ccs_diagnosis_code                  <chr> "128", "251", "184", "189", "042",…
$ ccs_diagnosis_description           <chr> "Asthma", "Abdominal pain", "Early…
$ ccs_procedure_code                  <chr> "000", "000", "000", "134", "090",…
$ ccs_procedure_description           <chr> "NO PROC", "NO PROC", "NO PROC", "…
$ apr_drg_code                        <chr> "141", "251", "563", "540", "224",…
$ apr_drg_description                 <chr> "Asthma", "Abdominal pain", "Prete…
$ apr_mdc_code                        <chr> "04", "06", "14", "14", "06", "14"…
$ apr_mdc_description                 <chr> "Diseases and Disorders of the Res…
$ apr_severity_of_illness_code        <chr> "2", "1", "2", "1", "2", "1", "1",…
$ apr_severity_of_illness_description <chr> "Moderate", "Minor", "Moderate", "…
$ apr_risk_of_mortality               <chr> "Minor", "Minor", "Minor", "Minor"…
$ apr_medical_surgical_description    <chr> "Medical", "Medical", "Medical", "…
$ source_of_payment_1                 <chr> "Medicaid", "Medicaid", "Medicaid"…
$ source_of_payment_2                 <chr> "Medicaid", "Self-Pay", "Medicaid"…
$ source_of_payment_3                 <chr> "Self-Pay", NA, "Self-Pay", NA, "S…
$ birth_weight                        <chr> "0000", "0000", "0000", "0000", "0…
$ abortion_edit_indicator             <chr> "N", "N", "N", "N", "N", "N", "N",…
$ emergency_department_indicator      <chr> "N", "Y", "N", "N", "N", "N", "N",…
$ total_charges                       <chr> "9965.25", "6685.95", "12885.00", …
$ total_costs                         <chr> "3269.66", "2148.08", "4132.75", "…

It looks like I’m only downloading 1000 rows from each dataset. Hmmm. It looks like I can export the button by pressing the ‘Export’ button, but I don’t want to have datasets on my laptop. I rather load them into my environment as needed by whatever I would end up developing. Maybe I need to read these SODA docs that are linked on the site.

Screenshot of SODA Getting Started docs

Screenshot of SODA Getting Started docs

So it seems like I need to use some extra parameter filtering? Let’s try it.

jobj <- jsonlite::read_json("https://health.data.ny.gov/resource/u4ud-w55t.json?apr_drg_description=Asthma",simplifyVector = TRUE)

jobj %>% dplyr::glimpse()
Rows: 1,000
Columns: 34
$ hospital_service_area               <chr> "Capital/Adirond", "Capital/Adiron…
$ hospital_county                     <chr> "Albany", "Albany", "Albany", "Alb…
$ operating_certificate_number        <chr> "0101000", "0101000", "0101000", "…
$ facility_id                         <chr> "1", "1", "1", "1", "1", "1", "1",…
$ facility_name                       <chr> "Albany Medical Center Hospital", …
$ age_group                           <chr> "0 to 17", "0 to 17", "0 to 17", "…
$ zip_code_3_digits                   <chr> "121", "124", "128", "OOS", "124",…
$ gender                              <chr> "F", "M", "M", "M", "M", "F", "M",…
$ race                                <chr> "Other Race", "White", "Other Race…
$ ethnicity                           <chr> "Not Span/Hispanic", "Not Span/His…
$ length_of_stay                      <chr> "1", "1", "1", "1", "1", "1", "1",…
$ type_of_admission                   <chr> "Urgent", "Urgent", "Urgent", "Urg…
$ patient_disposition                 <chr> "Home or Self Care", "Home or Self…
$ discharge_year                      <chr> "2012", "2012", "2012", "2012", "2…
$ ccs_diagnosis_code                  <chr> "128", "128", "128", "128", "128",…
$ ccs_diagnosis_description           <chr> "Asthma", "Asthma", "Asthma", "Ast…
$ ccs_procedure_code                  <chr> "000", "000", "000", "000", "000",…
$ ccs_procedure_description           <chr> "NO PROC", "NO PROC", "NO PROC", "…
$ apr_drg_code                        <chr> "141", "141", "141", "141", "141",…
$ apr_drg_description                 <chr> "Asthma", "Asthma", "Asthma", "Ast…
$ apr_mdc_code                        <chr> "04", "04", "04", "04", "04", "04"…
$ apr_mdc_description                 <chr> "Diseases and Disorders of the Res…
$ apr_severity_of_illness_code        <chr> "2", "1", "2", "2", "1", "1", "1",…
$ apr_severity_of_illness_description <chr> "Moderate", "Minor", "Moderate", "…
$ apr_risk_of_mortality               <chr> "Minor", "Minor", "Minor", "Minor"…
$ apr_medical_surgical_description    <chr> "Medical", "Medical", "Medical", "…
$ source_of_payment_1                 <chr> "Medicaid", "Managed Care, Unspeci…
$ source_of_payment_2                 <chr> "Medicaid", "Self-Pay", "Self-Pay"…
$ source_of_payment_3                 <chr> "Self-Pay", NA, NA, NA, NA, NA, NA…
$ birth_weight                        <chr> "0000", "0000", "0000", "0000", "0…
$ abortion_edit_indicator             <chr> "N", "N", "N", "N", "N", "N", "N",…
$ emergency_department_indicator      <chr> "N", "N", "N", "N", "N", "N", "N",…
$ total_charges                       <chr> "9965.25", "7003.18", "3458.39", "…
$ total_costs                         <chr> "3269.66", "2313.70", "1175.09", "…

I think that worked - I only downloaded Asthma items! But still, a limited dataset of 1000 rows was downloaded. What do I need to download all items? Ah, looks like I need to read about the application token usage.

Screenshot of throttling and application tokens

Screenshot of throttling and application tokens

I must use an App token. I will create one in my profile settings - this must be why my previous self made the account earlier on in this post. Huzzah! So now that I created an app token, let’s try using it. I’ll use the glue package to read it in from a hidden file so I don’t have it sown here 😁

library(glue)

jobj <- jsonlite::read_json(glue::glue("https://health.data.ny.gov/resource/u4ud-w55t.json?$$app_token={read_json('.apptoken')[['token']]}"),simplifyVector = T)

jobj %>% dplyr::glimpse()
Rows: 1,000
Columns: 34
$ hospital_service_area               <chr> "Capital/Adirond", "Capital/Adiron…
$ hospital_county                     <chr> "Albany", "Albany", "Albany", "Alb…
$ operating_certificate_number        <chr> "0101000", "0101000", "0101000", "…
$ facility_id                         <chr> "1", "1", "1", "1", "1", "1", "1",…
$ facility_name                       <chr> "Albany Medical Center Hospital", …
$ age_group                           <chr> "0 to 17", "0 to 17", "18 to 29", …
$ zip_code_3_digits                   <chr> "121", "121", "120", "121", "124",…
$ gender                              <chr> "F", "F", "F", "F", "M", "F", "M",…
$ race                                <chr> "Other Race", "White", "White", "W…
$ ethnicity                           <chr> "Not Span/Hispanic", "Not Span/His…
$ length_of_stay                      <chr> "1", "1", "3", "2", "3", "3", "3",…
$ type_of_admission                   <chr> "Urgent", "Emergency", "Urgent", "…
$ patient_disposition                 <chr> "Home or Self Care", "Home or Self…
$ discharge_year                      <chr> "2012", "2012", "2012", "2012", "2…
$ ccs_diagnosis_code                  <chr> "128", "251", "184", "189", "042",…
$ ccs_diagnosis_description           <chr> "Asthma", "Abdominal pain", "Early…
$ ccs_procedure_code                  <chr> "000", "000", "000", "134", "090",…
$ ccs_procedure_description           <chr> "NO PROC", "NO PROC", "NO PROC", "…
$ apr_drg_code                        <chr> "141", "251", "563", "540", "224",…
$ apr_drg_description                 <chr> "Asthma", "Abdominal pain", "Prete…
$ apr_mdc_code                        <chr> "04", "06", "14", "14", "06", "14"…
$ apr_mdc_description                 <chr> "Diseases and Disorders of the Res…
$ apr_severity_of_illness_code        <chr> "2", "1", "2", "1", "2", "1", "1",…
$ apr_severity_of_illness_description <chr> "Moderate", "Minor", "Moderate", "…
$ apr_risk_of_mortality               <chr> "Minor", "Minor", "Minor", "Minor"…
$ apr_medical_surgical_description    <chr> "Medical", "Medical", "Medical", "…
$ source_of_payment_1                 <chr> "Medicaid", "Medicaid", "Medicaid"…
$ source_of_payment_2                 <chr> "Medicaid", "Self-Pay", "Medicaid"…
$ source_of_payment_3                 <chr> "Self-Pay", NA, "Self-Pay", NA, "S…
$ birth_weight                        <chr> "0000", "0000", "0000", "0000", "0…
$ abortion_edit_indicator             <chr> "N", "N", "N", "N", "N", "N", "N",…
$ emergency_department_indicator      <chr> "N", "Y", "N", "N", "N", "N", "N",…
$ total_charges                       <chr> "9965.25", "6685.95", "12885.00", …
$ total_costs                         <chr> "3269.66", "2148.08", "4132.75", "…

Hmmm still a limited dataset downloaded even with the app token. Let’s see if there is an http message.

library(httr)
res <- httr::GET(glue::glue("https://health.data.ny.gov/resource/u4ud-w55t.json?$$app_token={read_json('.apptoken')[['token']]}"))

names(res)
 [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
 [6] "content"     "date"        "times"       "request"     "handle"     
res$status_code
[1] 200
rlist <- httr::content(res)

do.call(dplyr::bind_rows, rlist) %>% dplyr::glimpse()
Rows: 1,000
Columns: 34
$ hospital_service_area               <chr> "Capital/Adirond", "Capital/Adiron…
$ hospital_county                     <chr> "Albany", "Albany", "Albany", "Alb…
$ operating_certificate_number        <chr> "0101000", "0101000", "0101000", "…
$ facility_id                         <chr> "1", "1", "1", "1", "1", "1", "1",…
$ facility_name                       <chr> "Albany Medical Center Hospital", …
$ age_group                           <chr> "0 to 17", "0 to 17", "18 to 29", …
$ zip_code_3_digits                   <chr> "121", "121", "120", "121", "124",…
$ gender                              <chr> "F", "F", "F", "F", "M", "F", "M",…
$ race                                <chr> "Other Race", "White", "White", "W…
$ ethnicity                           <chr> "Not Span/Hispanic", "Not Span/His…
$ length_of_stay                      <chr> "1", "1", "3", "2", "3", "3", "3",…
$ type_of_admission                   <chr> "Urgent", "Emergency", "Urgent", "…
$ patient_disposition                 <chr> "Home or Self Care", "Home or Self…
$ discharge_year                      <chr> "2012", "2012", "2012", "2012", "2…
$ ccs_diagnosis_code                  <chr> "128", "251", "184", "189", "042",…
$ ccs_diagnosis_description           <chr> "Asthma", "Abdominal pain", "Early…
$ ccs_procedure_code                  <chr> "000", "000", "000", "134", "090",…
$ ccs_procedure_description           <chr> "NO PROC", "NO PROC", "NO PROC", "…
$ apr_drg_code                        <chr> "141", "251", "563", "540", "224",…
$ apr_drg_description                 <chr> "Asthma", "Abdominal pain", "Prete…
$ apr_mdc_code                        <chr> "04", "06", "14", "14", "06", "14"…
$ apr_mdc_description                 <chr> "Diseases and Disorders of the Res…
$ apr_severity_of_illness_code        <chr> "2", "1", "2", "1", "2", "1", "1",…
$ apr_severity_of_illness_description <chr> "Moderate", "Minor", "Moderate", "…
$ apr_risk_of_mortality               <chr> "Minor", "Minor", "Minor", "Minor"…
$ apr_medical_surgical_description    <chr> "Medical", "Medical", "Medical", "…
$ source_of_payment_1                 <chr> "Medicaid", "Medicaid", "Medicaid"…
$ source_of_payment_2                 <chr> "Medicaid", "Self-Pay", "Medicaid"…
$ source_of_payment_3                 <chr> "Self-Pay", NA, "Self-Pay", NA, "S…
$ birth_weight                        <chr> "0000", "0000", "0000", "0000", "0…
$ abortion_edit_indicator             <chr> "N", "N", "N", "N", "N", "N", "N",…
$ emergency_department_indicator      <chr> "N", "Y", "N", "N", "N", "N", "N",…
$ total_charges                       <chr> "9965.25", "6685.95", "12885.00", …
$ total_costs                         <chr> "3269.66", "2148.08", "4132.75", "…

Not that I can tell, and still a limited dataset is downloaded. After some google searches, I found the {RSocrata} that may be helpful.

library(RSocrata)
dat <- RSocrata::read.socrata("https://health.data.ny.gov/resource/u4ud-w55t.json",app_token = read_json('.apptoken')[['token']])

dat %>% dplyr::glimpse()
Rows: 2,544,543
Columns: 34
$ hospital_service_area               <chr> "Western NY", "Western NY", "Weste…
$ hospital_county                     <chr> "Allegany", "Allegany", "Allegany"…
$ operating_certificate_number        <chr> "0226700", "0226700", "0226700", "…
$ facility_id                         <chr> "37", "37", "37", "37", "37", "37"…
$ facility_name                       <chr> "Cuba Memorial Hospital Inc", "Cub…
$ age_group                           <chr> "30 to 49", "70 or Older", "30 to …
$ zip_code_3_digits                   <chr> "147", "147", "147", "147", "147",…
$ gender                              <chr> "F", "F", "F", "F", "F", "F", "M",…
$ race                                <chr> "White", "White", "White", "White"…
$ ethnicity                           <chr> "Not Span/Hispanic", "Not Span/His…
$ length_of_stay                      <chr> "4", "4", "3", "1", "3", "1", "3",…
$ type_of_admission                   <chr> "Elective", "Urgent", "Urgent", "U…
$ patient_disposition                 <chr> "Home or Self Care", "Short-term H…
$ discharge_year                      <chr> "2012", "2012", "2012", "2012", "2…
$ ccs_diagnosis_code                  <chr> "122", "197", "122", "122", "122",…
$ ccs_diagnosis_description           <chr> "Pneumonia (except that caused by …
$ ccs_procedure_code                  <chr> "000", "000", "000", "000", "000",…
$ ccs_procedure_description           <chr> "NO PROC", "NO PROC", "NO PROC", "…
$ apr_drg_code                        <chr> "139", "383", "139", "139", "139",…
$ apr_drg_description                 <chr> "Other pneumonia", "Cellulitis & o…
$ apr_mdc_code                        <chr> "04", "09", "04", "04", "04", "06"…
$ apr_mdc_description                 <chr> "Diseases and Disorders of the Res…
$ apr_severity_of_illness_code        <chr> "1", "3", "1", "1", "2", "1", "2",…
$ apr_severity_of_illness_description <chr> "Minor", "Major", "Minor", "Minor"…
$ apr_risk_of_mortality               <chr> "Minor", "Major", "Minor", "Minor"…
$ apr_medical_surgical_description    <chr> "Medical", "Medical", "Medical", "…
$ source_of_payment_1                 <chr> "Blue Cross/Blue Shield", "Medicar…
$ birth_weight                        <chr> "0000", "0000", "0000", "0000", "0…
$ abortion_edit_indicator             <chr> "N", "N", "N", "N", "N", "N", "N",…
$ emergency_department_indicator      <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
$ total_charges                       <chr> "5511.95", "4783.20", "3829.15", "…
$ total_costs                         <chr> "5582.49", "5162.82", "4056.52", "…
$ source_of_payment_2                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ source_of_payment_3                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA…

Eureka! This downloaded a full dataset. So seems like I can use this package to start reading in and playing with full datasets. I can also use the API to make smaller requests based on filters, too. Tis will be better in the long run since I don’t want to have to download entire datasets. But maybe I will once and store using {arrow} so a web app or notebook doesn’t make a lot of API calls. We shall see how this journey unfolds!

Hope this train of thought is useful to you readers! I’ll definitely come back to this.