01

Welcome To Open Case Studies

Welcome
To Open Case Studies

Connecting you with real-world public health data.

The Open Case Studies project showcases the possibilities of what can be achieved when working with real-world data.

Housed in a freely accessible GitHub repository, the project’s self-contained and experiential guides demonstrate the data analysis process and the use of various data science methods, tools, and software in the context of messy, real-world data.

These case studies will empower current and future data scientists to leverage real-world data to solve leading public health challenges.

02

Who Are Open Case Studies For?

Who Are Open Case Studies For?

Your experiential guide to the power of data analysis.

The Open Case Studies project provides insights about gathering and working with data for students, instructors, and those with experience in data science or statistical methods at nonprofit organizations and public sector agencies.

Each case study in the project focuses on an important public health topic and introduces methods to provide users with the skills and knowledge for greater legibility, reproducibility, rigor, and flexibility in their own data analyses.

03

Case Study Bank Overview

Case
Study Bank Overview

Real data on ten public health challenges in the U.S.

The following in-depth case studies use real data and focus on five areas of public health that are particularly pressing in the United States.

Addiction & Overdose

Vaping Behaviors in American Youth

Addiction & Overdose

Vaping Behaviors in American Youth

This case study explores the trends of tobacco product usage among American youths surveyed in the National Youth Tobacco Survey (NYTS) from 2015-2019. It demonstrates how to use survey data and code books and provides an introduction to writing functions to wrangle similar but slightly different data repetitively. The case study introduces packages for using survey weighting and survey design to perform an analysis to compare vaping product usage among different groups, and covers how to use a logistic regression to compare groups for a variable that is binary (such as true or false — in this case it was using vaping products or not). This case study also covers how to make visualizations of multiple groups over time with confidence interval error bars.

View Case Study
Addiction & Overdose

Opioids in the United States

Addiction & Overdose

Opioids in the United States

This case study examines the number of opioid pills (specifically oxycodone and hydrocodone, as they are the top two misused opioids) shipped to pharmacies and practitioners at the county-level around the United States from 2006 to 2014 using data from the Drug Enforcement Administration (DEA). This case study demonstrates how to get data from a source called an application programming interface (API). It explores why and how to normalize data, as well as why and how to potentially stratify or redefine groups. It also shows how to compare two independent groups when the data is not normally distributed using a test called the Wilcoxon rank sum test (also called the Mann Whitney U test) and how to add confidence intervals to plots (using a method called bootstrapping).

View Case Study
Adolescent Health

Disparities in Youth Disconnection

Adolescent Health

Disparities in Youth Disconnection

This case study focuses on rates of youth (people between 16-24) disconnection (those who are neither working nor in school) among different racial, ethnic and gender subgroups to identify subgroups that may be particularly vulnerable. It demonstrates that deeper inspection of subgroups yields some differences that are not otherwise discernable, how to import data from a PDF using screenshots of sections of the PDF, and how to use the Mann-Kendall trend test to test for the presence of a consistent direction in the relationship of disconnection rates with time. This case study also shows how to make a visualization that stylistically matches that of an existing report, how to add images to plots, and how to create effective bar plots for multiple comparisons across several groups.

View Case Study
Adolescent Health

Mental Health of American Youth

Adolescent Health

Mental Health of American Youth

This case study investigates how the rate of self-reported symptoms of major depressive episodes (MDE) has changed over time among American youth (age 12-17) from 2004-2018. It describes the impact of self-reporting bias in surveys, how to get data directly from a website, as well as how to compare changes in the frequency of a variable between two groups using a chi-squared test to determine if two variables are independent (in this case if the sex of the students influenced the frequency of reported MDE symptoms in 2004 and 2018). This case study also demonstrates how to create direct labels on visualizations with many groups across time, as well as how to create an animated gif.

View Case Study
Environmental Challenges

Exploring CO2 Emissions Across Time

Environmental Challenges

Exploring CO2 Emissions Across Time

This case study investigates how CO2 emissions have changed since the 1700s and how the level of emissions has compared for different countries around the world. It explores how yearly average temperature and the number of natural disasters in the United States has changed over time and provides an introduction for examining if two sets of data are correlated with one another. This case study also goes into great detail about how to make what are called heatmaps and other plots to visualize multiple groups over time. This includes adding labels directly to lines on plots with multiple lines.

View Case Study
Environmental Challenges

Predicting Annual Air Pollution

Environmental Challenges

Predicting Annual Air Pollution

This case study uses machine learning methods to predict annual air pollution levels spatially within the United States based on data about population density, urbanization, road density, as well as satellite pollution data and chemical modeling data among other predictors. Machine learning methods are used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems. The case study also demonstrates how to visualize data using maps.

View Case Study
Obesity & The Food System

Exploring Global Patterns of Obesity Across Rural and Urban Regions

Obesity & The Food System

Exploring Global Patterns of Obesity Across Rural and Urban Regions

This case study compares average Body Mass Index measurements for males and females from rural and urban regions from over 200 countries around the world, with a particular emphasis on the United States. It provides a thorough introduction to wrangling data from a PDF, how to compare two paired groups using the t test and the nonparametric Wilcoxon signed-rank test using R programming, and how to make visualizations of group comparisons that emphasize a particular subset of the data.

View Case Study
Obesity & The Food System

Exploring Global Patterns of Dietary Behaviors Associated with Health Risk

Obesity & The Food System

Exploring Global Patterns of Dietary Behaviors Associated with Health Risk

This case study investigates the consumption of dietary factors associated with health risk among males and females from over 200 countries around the world, with a particular emphasis on the United States. It demonstrates how to wrangle data from a PDF; how to combine data from two different sources; how to compare two paired groups and multiple paired groups using t-tests, ANOVA, and linear regression; and how to create visualizations of several groups and how to combine plots together with very different scales.

View Case Study
Violence

Influence of Multicollinearity on Measured Impact of Right-To-Carry Gun Laws

Violence

Influence of Multicollinearity on Measured Impact of Right-To-Carry Gun Laws

This case study focuses on two well-known studies that evaluated the influence of right-to-carry gun laws on violent crime rates. It demonstrates a phenomenon called multicollinearity, where explanatory variables that can predict one another can lead to aberrant and unstable findings; how to make visualizations with labels, such as arrows or equations; and how to combine multiple plots together.

View Case Study
Violence

School Shootings in the United States

Violence

School Shootings in the United States

This case study illustrates ways to communicate trends in a dataset about the number and characteristics of school shooting events for students in grades K-12 in the United States since 1970. It demonstrates how to create a dashboard, which is a website that shows patterns in a dataset in a concise manner; how to import data from a Google Sheets document; how to create interactive tables and maps; and how to properly calculate percentages for data when there are missing values.

View Case Study
Obesity & The Food System

Exploring Global Patterns of Obesity Across Rural and Urban Regions

This case study compares average Body Mass Index measurements for males and females from rural and urban regions from over 200 countries around the world, with a particular emphasis on the United States. It provides a thorough introduction to wrangling data from a PDF, how to compare two paired groups using the t test and the nonparametric Wilcoxon signed-rank test using R programming, and how to make visualizations of group comparisons that emphasize a particular subset of the data.

View Case Study
04

Which Case Study IsRight For Me?

Which
Case
Study IsRight For Me?

Connecting with the public health data you need.

The Open Case Studies project approaches data in many different ways. The guide below will help connect you with a case study:

Questions

Data science projects often start with a question. Here, you may look for case studies that explore a question that is similar to one you are interested in investigating with your data.

Data Type

Data can come from many different sources, from the more obvious like an excel file to the less obvious like an image or a website. These case studies demonstrate how to use data from a variety of possible sources.

Data Wrangling

Data wrangling is the process of organizing your data in a more useful format. These case studies explore how to clean, rearrange, reshape, modify, filter, combine, or join your data.

Data Visualization

A picture is worth a thousand words, particularly when it comes to interpreting data. These case studies demonstrate how to make effective visualizations in various contexts. The first ten represent basic visualizations while 11-22 are more advanced.

Data Analysis

To better understand data, it is helpful to use statistical tests. These case studies demonstrate a variety of statistical tests and concepts.

05

About The Project

About The Project

Learn about the team behind the Open Case Studies project.

As part of the larger Open Case Studies project (OCS) at opencasestudies.org, these case studies were developed for and funded by the Bloomberg American Health Initiative. The OCS project is made up of a team of researchers at the Johns Hopkins Bloomberg School of Public Health (JHSPH).

Let us know how the Open Case Studies project has enhanced your educational curriculum or ability to tackle tough data-rich research projects.

Share Feedback
Stephanie Hicks,
PhD, MA
Assistant Professor Principal Investigator
Carrie Wright,
PhD
Research Associate
Leah Jager,
PhD
Assistant Scientist
Margaret Taub,
PhD
Associate Scientist
Michael Ontiveros,
MHS
Research Assistant
Kexin (Sheena) Wang,
MSE
Research Assistant
John Muschelli,
ScM, PhD
Associate Scientist

JHSPH Faculty Contributors

Jessica Fanzo, PhD

Brendan Saloner, PhD

Megan Latshaw, PhD, MHS

Renee M. Johnson, PhD, MPH

Daniel Webster, ScD, MPH

Elizabeth Stuart, PhD

Bloomberg American Health Initiative

Joshua M. Sharfstein, MD – Director, Bloomberg American Health Initiative

Michelle Spencer, MS – Associate Director, Bloomberg American Health Initiative

Paulani Mui, MPH – Special Projects Officer, Bloomberg American Health Initiative

Other Contributors

Aboozar Hadavand, PhD, MA, MS, Minerva University

Roger Peng, PhD, MS, Johns Hopkins Bloomberg School of Public Health

Kirsten Koehler, PhD, MS, Johns Hopkins Bloomberg School of Public Health

Alex McCourt, PhD, JD, MPH, Johns Hopkins Bloomberg School of Public Health

Ashkan Afshin, MD, ScD, MPH, MSc, University of Washington and Institute for Health Metrics and Evaluation (IHME)

Erin Mullany, BA, Institute for Health Metrics and Evaluation (IHME)

External Review Panel

Leslie Myint, PhD, Macalester College

Shannon E. Ellis, PhD, University of California – San Diego

Christina Knudson, PhD, University of St. Thomas

Michael Love, PhD, University of North Carolina

Nicholas Horton, ScD, Amherst College

Mine Çetinkaya-Rundel, PhD, University of Edinburgh, Duke University, RStudio