Open Case Studies

The Open Case Studies project showcases the possibilities of what can be achieved when working with real-world data.

Housed in a freely accessible GitHub repository, the project’s self-contained and experiential guides demonstrate the data analysis process and the use of various data science methods, tools, and software in the context of messy, real-world data.

These case studies will empower current and future data scientists to leverage real-world data to solve leading public health challenges.

The Open Case Studies project provides insights about gathering and working with data for students, instructors, and those with experience in data science or statistical methods at nonprofit organizations and public sector agencies.

Each case study in the project focuses on an important public health topic and introduces methods to provide users with the skills and knowledge for greater legibility, reproducibility, rigor, and flexibility in their own data analyses.

The following in-depth case studies use real data and focus on five areas of public health that are particularly pressing in the United States.

Addiction & Overdose

Vaping Behaviors in American Youth

Addiction & Overdose

Vaping Behaviors in American Youth

This case study explores the trends of tobacco product usage among American youths surveyed in the National Youth Tobacco Survey (NYTS) from 2015-2019. It demonstrates how to use survey data and code books and provides an introduction to writing functions to wrangle similar but slightly different data repetitively. The case study introduces packages for using survey weighting and survey design to perform an analysis to compare vaping product usage among different groups, and covers how to use a logistic regression to compare groups for a variable that is binary (such as true or false — in this case it was using vaping products or not). This case study also covers how to make visualizations of multiple groups over time with confidence interval error bars.

View Case Study

Addiction & Overdose

Opioids in the United States

Addiction & Overdose

Opioids in the United States

This case study examines the number of opioid pills (specifically oxycodone and hydrocodone, as they are the top two misused opioids) shipped to pharmacies and practitioners at the county-level around the United States from 2006 to 2014 using data from the Drug Enforcement Administration (DEA). This case study demonstrates how to get data from a source called an application programming interface (API). It explores why and how to normalize data, as well as why and how to potentially stratify or redefine groups. It also shows how to compare two independent groups when the data is not normally distributed using a test called the Wilcoxon rank sum test (also called the Mann Whitney U test) and how to add confidence intervals to plots (using a method called bootstrapping).

View Case Study

Adolescent Health

Disparities in Youth Disconnection

Adolescent Health

Disparities in Youth Disconnection

This case study focuses on rates of youth (people between 16-24) disconnection (those who are neither working nor in school) among different racial, ethnic and gender subgroups to identify subgroups that may be particularly vulnerable. It demonstrates that deeper inspection of subgroups yields some differences that are not otherwise discernable, how to import data from a PDF using screenshots of sections of the PDF, and how to use the Mann-Kendall trend test to test for the presence of a consistent direction in the relationship of disconnection rates with time. This case study also shows how to make a visualization that stylistically matches that of an existing report, how to add images to plots, and how to create effective bar plots for multiple comparisons across several groups.

View Case Study

Adolescent Health

Mental Health of American Youth

Adolescent Health

Mental Health of American Youth

This case study investigates how the rate of self-reported symptoms of major depressive episodes (MDE) has changed over time among American youth (age 12-17) from 2004-2018. It describes the impact of self-reporting bias in surveys, how to get data directly from a website, as well as how to compare changes in the frequency of a variable between two groups using a chi-squared test to determine if two variables are independent (in this case if the sex of the students influenced the frequency of reported MDE symptoms in 2004 and 2018). This case study also demonstrates how to create direct labels on visualizations with many groups across time, as well as how to create an animated gif.

View Case Study

Environmental Challenges

Exploring CO2 Emissions Across Time

Environmental Challenges

Exploring CO2 Emissions Across Time

This case study investigates how CO2 emissions have changed since the 1700s and how the level of emissions has compared for different countries around the world. It explores how yearly average temperature and the number of natural disasters in the United States has changed over time and provides an introduction for examining if two sets of data are correlated with one another. This case study also goes into great detail about how to make what are called heatmaps and other plots to visualize multiple groups over time. This includes adding labels directly to lines on plots with multiple lines.

View Case Study

Environmental Challenges

Predicting Annual Air Pollution

Environmental Challenges

Predicting Annual Air Pollution

This case study uses machine learning methods to predict annual air pollution levels spatially within the United States based on data about population density, urbanization, road density, as well as satellite pollution data and chemical modeling data among other predictors. Machine learning methods are used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems. The case study also demonstrates how to visualize data using maps.

View Case Study

Obesity & The Food System

Exploring Global Patterns of Obesity Across Rural and Urban Regions

Obesity & The Food System

Exploring Global Patterns of Obesity Across Rural and Urban Regions

This case study compares average Body Mass Index measurements for males and females from rural and urban regions from over 200 countries around the world, with a particular emphasis on the United States. It provides a thorough introduction to wrangling data from a PDF, how to compare two paired groups using the t test and the nonparametric Wilcoxon signed-rank test using R programming, and how to make visualizations of group comparisons that emphasize a particular subset of the data.

View Case Study

Obesity & The Food System

Exploring Global Patterns of Dietary Behaviors Associated with Health Risk

Obesity & The Food System

Exploring Global Patterns of Dietary Behaviors Associated with Health Risk

This case study investigates the consumption of dietary factors associated with health risk among males and females from over 200 countries around the world, with a particular emphasis on the United States. It demonstrates how to wrangle data from a PDF; how to combine data from two different sources; how to compare two paired groups and multiple paired groups using t-tests, ANOVA, and linear regression; and how to create visualizations of several groups and how to combine plots together with very different scales.

View Case Study

Violence

Influence of Multicollinearity on Measured Impact of Right-To-Carry Gun Laws

Violence

Influence of Multicollinearity on Measured Impact of Right-To-Carry Gun Laws

This case study focuses on two well-known studies that evaluated the influence of right-to-carry gun laws on violent crime rates. It demonstrates a phenomenon called multicollinearity, where explanatory variables that can predict one another can lead to aberrant and unstable findings; how to make visualizations with labels, such as arrows or equations; and how to combine multiple plots together.

View Case Study

Violence

School Shootings in the United States

Violence

School Shootings in the United States

This case study illustrates ways to communicate trends in a dataset about the number and characteristics of school shooting events for students in grades K-12 in the United States since 1970. It demonstrates how to create a dashboard, which is a website that shows patterns in a dataset in a concise manner; how to import data from a Google Sheets document; how to create interactive tables and maps; and how to properly calculate percentages for data when there are missing values.

View Case Study

Obesity & The Food System

Exploring Global Patterns of Obesity Across Rural and Urban Regions

View Case Study

The Open Case Studies project approaches data in many different ways. The guide below will help connect you with a case study:

Questions

Data science projects often start with a question. Here, you may look for case studies that explore a question that is similar to one you are interested in investigating with your data.

How does something change over time?

Investigating how a variable has changed over time can help identify consistent trends.
Case Study: Disparities in Youth Disconnection
How do survey responses compare for different groups over time?

Survey data requires special care and attention to the survey design.
Case Study: Vaping Behaviors in American Youth
How do groups compare?

Public health researchers are often interested to know if one group is more vulnerable than another or if two or more groups are actually different from one another.
Case Study: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk
How do groups compare over time?

Comparing several groups over time can provide insight into if the change over time is different for different groups.
Case Study: Mental Health of American Youth
How do paired groups compare?

Paired groups are those that are not independent in some way. Perhaps you want to know how data from the same person over time compares with that of another person over time, or perhaps you are interested in how something changed in a city before and after an intervention, or perhaps you want to compare groups using data that has structure where there is coupling or matching of data values across samples.
Case Study 1: Exploring Global Patterns of Obesity Across Rural and Urban Regions Case Study 2: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk
Are certain groups or possibly subgroups more vulnerable?

Understand how to compare subpopulations at a deeper level.
Case Study 1: Opioids in the United States Case Study 2: Disparities in Youth Disconnection
How does something compare across regions?

Often it is useful to investigate if data differs by region, as many environmental, cultural, and political differences can influence public health outcomes.
Case Study 1: Opioids in the United States Case Study 2: Predicting Annual Air Pollution
How can I predict outcomes for new data?

Learn how the data might look next year or for locations that you don’t have data about.
Case Study 1: Predicting Annual Air Pollution
Does this influence my data?

Analyze how a variable influences another variable.
Case Study 1: Influence of Multicollinearity on Measured Impact of Right-to-Carry Gun Laws
Are these two variables related to one another?

Understand how two variables are related and how strongly they are related to one another.
Case Study 1: Exploring CO2 emissions across time
How can I display this data for others to find and interpret and use easily?

Make it easy for others to find your data, see the major trends in your data, or search for specific values in your data.
Case Study 1: School Shootings in the United States

Data Type

Data can come from many different sources, from the more obvious like an excel file to the less obvious like an image or a website. These case studies demonstrate how to use data from a variety of possible sources.

PDF

Using data from a PDF or just parts of a PDF can be challenging. You could type the data into a new excel file, but this can result in mistakes and it is difficult to reproduce.
Case Study 1: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk Case Study 2: Exploring Global Patterns of Obesity Across Rural and Urban Regions Case Study 3: Disparities in Youth Disconnection
CSV

Data are often in CSV files and it is typically easy to import data and work with data in this form. However, sometimes it can be difficult if, for example, the first few lines are structured differently or if you have unusual missing value indicators.
Case Study: Exploring CO2 emissions across time
Website

If you find data on a website that doesn’t allow you to download in a convenient way, you can actually directly import the data into R programming language.
Case Study: Mental Health of American Youth
Excel

This is one of the most common data forms, and it is typically easy to import data and work with data in this form. However, sometimes it can be challenging, especially if you have many files.
Case Study: Vaping Behaviors in American Youth
Image text

You can extract text from image files. This can be useful if, for example, you want to only use certain parts of a PDF.
Case Study: Disparities in Youth Disconnection
API

It is possible to find the data that you need to use from an application programming interface (API).
Case Study: Opioids in the United States
Google Sheet

You can download data from a Google Sheet, copy and paste it into Excel, or directly import the data into R programming language.
Case Study: School Shootings in the United States
Survey data/Code books

Working with survey data requires special care and attention, and you can do this directly with R programming language.
Case Study: Vaping Behaviors in American Youth
Multiple files

If you find that you need to import data from multiple files, there is a more efficient way to do so without importing each one by one.
Case Study: Vaping Behaviors in American Youth

Data Wrangling

Data wrangling is the process of organizing your data in a more useful format. These case studies explore how to clean, rearrange, reshape, modify, filter, combine, or join your data.

Extracting data from a PDF

Extracting and organizing data from a PDF will make it easier to use.
Case Study 1: Disparities in Youth Disconnection Case Study 2: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk
Geocoding data

The process of assigning relevant latitude and longitude coordinates to data values is called geocoding. This can be helpful (although not always necessary) to create a map of your data.
Case Study 1: School Shootings in the United States
Recoding data

If you have data values that are confusing and could be changed to something better, or if you want to convert your data to true or false, you might want to consider recoding these values.
Case Study 1: Vaping Behaviors in American Youth
Methods of joining data

Sometimes, you obtain data from multiple sources that need to be combined together.
Case Study 1: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk
Filtering data

Perhaps you need to filter your data for only specific values for given variables. In other words, you might want to filter census employment data to only values for females who are also Black and live in Connecticut.
Case Study 1: Disparities in Youth Disconnection
Modifying data (normalizing, transforming, scaling etc.)

Sometimes it is difficult to know when or how to normalize data.
Case Study 1: Opioids in the United States
Working with text

You can work with, remove, replace, or change words, phrases, letters, numbers, or punctuation marks in your data.
Case Study 1: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk Case Study 2: Disparities in Youth Disconnection Case Study 3: Exploring Global Patterns of Obesity Across Rural and Urban Regions
Reshaping data

Sometimes it is useful to shape your data so that you have many columns (for example, when performing certain analyses), however it can be useful at other times (for example, when creating plots) to collapse multiple columns into fewer columns with more rows.
Case Study: Exploring CO2 emissions across time
Repetitive process

Sometimes you need to wrangle multiple datasets from different sources in a similar manner.
Case Study: Vaping Behaviors in American Youth

Data Visualization

A picture is worth a thousand words, particularly when it comes to interpreting data. These case studies demonstrate how to make effective visualizations in various contexts. The first ten represent basic visualizations while 11-22 are more advanced.

A table that is easy to interpret

Adding colors or simple graphics can make tables easier to interpret.
Case Study: Opioids in the United States
Scatter plot

Scatter plots can be a strong option for evaluating the relationship between variables, and especially for evaluating changes in a variable over time.
Case Study: Exploring CO2 emissions across time
Line plot

Line plots are often useful for evaluating changes over time.
Case Study 1: Vaping Behaviors in American Youth Case Study 2: Mental Health of American Youth
Bar plot

Bar plots are a good choice if you want to compare data to a threshold.
Case Study: Disparities in Youth Disconnection
Box plots

Box plots are particularly useful for comparing groups with many data values. They provide information about the spread of the data.
Case Study: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk
Pie chart/waffle plot

Pie charts or waffle plots can be a strong option when comparing relative percentages.
Case Study: School Shootings in the United States
Heat map

It can be difficult to visualize multiple groups at simultaneously. In these situations, heat maps can be a great option.
Case Study: Exploring CO2 emissions across time
Correlation plots

If you have many variables and need to know if they are correlated to one another, there are methods to efficiently check this.
Case Study: Predicting Annual Air Pollution
Visualize missing data

It can be helpful to quickly identify how much of your data is missing (has NA values).
Case Study: Opioids in the United States
Create a map of your data

Often the best way to interpret regional differences in data is to make a map.
Case Study: Predicting Annual Air Pollution
Advanced Visualizations
Matching a style

If you are working with collaborators, you can make your visualizations match the style of their figures.
Case Study: Disparities in Youth Disconnection
Faceted plots allow you to quickly create multiple plots at once

It can be difficult to visualize multiple groups at the same time, so faceted plots are a great option in this situation.
Case Study: Vaping Behaviors in American Youth
Adding labels directly to plots with many different groups

If you compare many groups over time, for example, it can be difficult to see which line corresponds to which group. Adding labels directly to these lines can be very helpful and negates the need for an overcomplicated legend.
Case Study 1: Exploring CO2 emissions across time Case Study 2: Mental Health of American Youth
Emphasize a particular group

Sometimes you will have several different groups and you want to highlight a specific group.
Case Study: Exploring Global Patterns of Obesity Across Rural and Urban Regions
Adding annotations to plots

Adding labels, such as thresholds, arrows, or equations, can make it easier for people to interpret your plot.
Case Study 1: Disparities in Youth Disconnection Case Study 2: Exploring Global Patterns of Obesity Across Rural and Urban Regions Case Study 3: Influence of Multicollinearity on Measured Impact of Right-to-Carry Gun Laws
Add error bars to your plot

Adding error bars can help convey information about the confidence of the estimates in your plots.
Case Study 1: Opioids in the United States
Combine multiple plots together

Sometimes it is useful to put a variety of plots together and add text to explain what the plot shows.
Case Study 1: Influence of Multicollinearity on Measured Impact of Right-to-Carry Gun Laws Case Study 2: Mental Health of American Youth Case Study 3: Opioids in the United States
Create an interactive plot when you have too many groups to label

If you compare a very large number of groups, it can be difficult to tell what is happening. Often it can help to make the plot interactive so that the user can hover over points or lines to see what they indicate.
Case Study 1: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk Case Study 2: Opioids in the United States
Create an interactive map of your data

Sometimes it is easiest to see regional differences by interacting with and exploring an interactive map.
Case Study: School Shootings in the United States
Create an interactive table of your data

Sometimes you might want to be able to search through your data or allow others to easily do so.
Case Study: School Shootings in the United States
Add images to your figures

Including images to a plot, such as a logo, can be a helpful addition.
Case Study: Disparities in Youth Disconnection
Create an interactive dashboard/website for your data

Dashboards can quickly convey major trends in a dataset, and they can also allow users to interact with the data to choose what aspects about the data they wish to explore.
Case Study: School Shootings in the United States

Data Analysis

To better understand data, it is helpful to use statistical tests. These case studies demonstrate a variety of statistical tests and concepts.

t-tests

Are two groups different?
Case Study: Exploring Global Patterns of Obesity Across Rural and Urban Regions
Correlation

Are two variables related to one another?
Case Study: Exploring CO2 emissions across time
ANOVA

Are multiple groups different?
Case Study: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk
Linear regression

Would you like to compare groups?
Case Study 1: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk Case Study 2: Influence of Multicollinearity on Measured Impact of Right-to-Carry Gun Laws
Chi-squared test of independence

Do the frequencies of two groups suggest that they are independent?
Case Study: Mental Health of American Youth
Mann-Kendall Trend test

Is there a consistent change over time?
Case Study: Disparities in Youth Disconnection
Machine learning

Would you like to predict data?
Case Study: Predicting Annual Air Pollution
Calculate percentages with missing data?

Would you like to calculate percentages, but you are missing some data?
Case Study: School Shootings in the United States

How does something change over time?

Investigating how a variable has changed over time can help identify consistent trends.
Case Study: Disparities in Youth Disconnection
How do survey responses compare for different groups over time?

Survey data requires special care and attention to the survey design.
Case Study: Vaping Behaviors in American Youth
How do groups compare?

Public health researchers are often interested to know if one group is more vulnerable than another or if two or more groups are actually different from one another.
Case Study: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk
How do groups compare over time?

Comparing several groups over time can provide insight into if the change over time is different for different groups.
Case Study: Mental Health of American Youth
How do paired groups compare?

Paired groups are those that are not independent in some way. Perhaps you want to know how data from the same person over time compares with that of another person over time, or perhaps you are interested in how something changed in a city before and after an intervention, or perhaps you want to compare groups using data that has structure where there is coupling or matching of data values across samples.
Case Study 1: Exploring Global Patterns of Obesity Across Rural and Urban Regions Case Study 2: Exploring Global Patterns of Dietary Behaviors Associated with Health Risk
Are certain groups or possibly subgroups more vulnerable?

Understand how to compare subpopulations at a deeper level.
Case Study 1: Opioids in the United States Case Study 2: Disparities in Youth Disconnection
How does something compare across regions?

Often it is useful to investigate if data differs by region, as many environmental, cultural, and political differences can influence public health outcomes.
Case Study 1: Opioids in the United States Case Study 2: Predicting Annual Air Pollution
How can I predict outcomes for new data?

Learn how the data might look next year or for locations that you don’t have data about.
Case Study 1: Predicting Annual Air Pollution
Does this influence my data?

Analyze how a variable influences another variable.
Case Study 1: Influence of Multicollinearity on Measured Impact of Right-to-Carry Gun Laws
Are these two variables related to one another?

Understand how two variables are related and how strongly they are related to one another.
Case Study 1: Exploring CO2 emissions across time
How can I display this data for others to find and interpret and use easily?

Make it easy for others to find your data, see the major trends in your data, or search for specific values in your data.
Case Study 1: School Shootings in the United States

As part of the larger Open Case Studies project (OCS) at opencasestudies.org, these case studies were developed for and funded by the Bloomberg American Health Initiative. The OCS project is made up of a team of researchers at the Johns Hopkins Bloomberg School of Public Health (JHSPH).

Let us know how the Open Case Studies project has enhanced your educational curriculum or ability to tackle tough data-rich research projects.

Stephanie Hicks,
PhD, MA Assistant Professor Principal Investigator

Carrie Wright,
PhD Research Associate

Leah Jager,
PhD Assistant Scientist

Margaret Taub,
PhD Associate Scientist

Michael Ontiveros,
MHS Research Assistant

Kexin (Sheena) Wang,
MSE Research Assistant

John Muschelli,
ScM, PhD Associate Scientist

JHSPH Faculty Contributors

Jessica Fanzo, PhD

Brendan Saloner, PhD

Megan Latshaw, PhD, MHS

Renee M. Johnson, PhD, MPH

Daniel Webster, ScD, MPH

Elizabeth Stuart, PhD

Bloomberg American Health Initiative

Joshua M. Sharfstein, MD – Director, Bloomberg American Health Initiative

Michelle Spencer, MS – Associate Director, Bloomberg American Health Initiative

Paulani Mui, MPH – Special Projects Officer, Bloomberg American Health Initiative

Other Contributors

Aboozar Hadavand, PhD, MA, MS, Minerva University

Roger Peng, PhD, MS, Johns Hopkins Bloomberg School of Public Health

Kirsten Koehler, PhD, MS, Johns Hopkins Bloomberg School of Public Health

Alex McCourt, PhD, JD, MPH, Johns Hopkins Bloomberg School of Public Health

Ashkan Afshin, MD, ScD, MPH, MSc, University of Washington and Institute for Health Metrics and Evaluation (IHME)

Erin Mullany, BA, Institute for Health Metrics and Evaluation (IHME)

External Review Panel

Leslie Myint, PhD, Macalester College

Shannon E. Ellis, PhD, University of California – San Diego

Christina Knudson, PhD, University of St. Thomas

Michael Love, PhD, University of North Carolina

Nicholas Horton, ScD, Amherst College

Mine Çetinkaya-Rundel, PhD, University of Edinburgh, Duke University, RStudio

Welcome To Open Case Studies

Welcome
To Open Case Studies

Who Are Open Case Studies For?

Who Are Open Case Studies For?

Case Study Bank Overview

Case
Study Bank Overview

Which Case Study IsRight For Me?

Which
Case
Study IsRight For Me?

About The Project

About The Project