Stata 101: The Epidemiologist's Interface.
A practical introduction to Stata for epidemiologists: the interface, the do-file editor and the first commands needed to understand a dataset.
Stata is one of the primary statistical environments used in epidemiology, public health and health services research. It has a consistent command structure and a do-file system that makes reproducibility straightforward rather than optional.
1. The anatomy of Stata
Stata's layout is deliberately constrained. The interface uses four windows, each serving a fixed purpose. Learn what each one does and you understand how Stata expects you to work.
- The Results window: a scrolling log where Stata prints models, tables and red error messages.
- The Command window: a place for isolated commands and quick checks, but not the home of publishable analysis.
- The Variables window: a live list of variables, datatypes and labels currently in memory.
- The Review or History window: a running ledger of every command executed in the session.
2. The golden rule: the do-file editor
As an epidemiologist handling sensitive health records, reproducibility is the foundation of the work. We do not build publishable analysis by clicking around dropdown menus.
Every action, from loading the raw dataset to recoding categorical variables to running the final regression model, should be written sequentially in a Stata script. This is called a do-file.
Open the do-file editor by typing doedit in the command
window. Write code there, highlight a block and click the execute
button to run it. This saves the pipeline so colleagues can review
it and peer reviewers can audit it.
3. Getting to know your data
Before running a hazard ratio or logistic regression, you need to
understand the shape of the cohort. Stata provides built-in health
datasets for learning. This guide uses bplong, a
hypothetical patient cohort tracking blood pressure over time.
Download the companion do-file
I have prepared an annotated Stata script that mirrors the workflow below. Download it, open it in Stata and run the blocks step by step.
Type this into the do-file and execute it block by block:
* 1. Clear memory before starting
clear all
* 2. Load the built-in blood pressure dataset
sysuse bplong
* 3. Describe the schema of the dataset
describe
The describe command tells you how many observations
exist and which variables are available. It does not tell you
whether the data is messy. When loading real clinical data, the
first check should usually be codebook.
* Get a granular breakdown of every variable, including missing values
codebook
codebook prints a detailed report for every variable,
including unique values, missing observations and any labels attached
to numeric categories.
Once you are satisfied the data is clean, generate summary statistics for continuous variables:
* Generate full summary statistics for the continuous blood pressure variable
summarize bp, detail
For categorical epidemiological risk factors, such as smoking status or deprivation quintiles, use cross-tabulations:
* Generate a frequency table for patient sex
tabulate sex
* Generate a cross-tabulation of sex against age group, including row percentages
tabulate sex agegrp, row
Syntax logic: Stata commands follow a consistent
structure: [command] [variables], [options]. The
comma separates what you want Stata to do from the options that
control how it does it, such as , detail or
, row.
Further resources
- Official Stata cheat sheets by Dr Tim Essam and Dr Laura Hughes.
- UCLA OARC Stata web books for applied examples and statistical workflows.