Stata 101: The Epidemiologist's Interface.

A practical introduction to Stata for epidemiologists: the interface, the do-file editor and the first commands needed to understand a dataset.

Stata is one of the primary statistical environments used in epidemiology, public health and health services research. It has a consistent command structure and a do-file system that makes reproducibility straightforward rather than optional.

1. The anatomy of Stata

Stata's layout is deliberately constrained. The interface uses four windows, each serving a fixed purpose. Learn what each one does and you understand how Stata expects you to work.

Stata 18 graphical user interface on macOS
The main Stata interface: results, command, variables and review windows.
  • The Results window: a scrolling log where Stata prints models, tables and red error messages.
  • The Command window: a place for isolated commands and quick checks, but not the home of publishable analysis.
  • The Variables window: a live list of variables, datatypes and labels currently in memory.
  • The Review or History window: a running ledger of every command executed in the session.

2. The golden rule: the do-file editor

As an epidemiologist handling sensitive health records, reproducibility is the foundation of the work. We do not build publishable analysis by clicking around dropdown menus.

Every action, from loading the raw dataset to recoding categorical variables to running the final regression model, should be written sequentially in a Stata script. This is called a do-file.

Stata 18 do-file editor on macOS
The do-file editor is where the reproducible pipeline lives.

Open the do-file editor by typing doedit in the command window. Write code there, highlight a block and click the execute button to run it. This saves the pipeline so colleagues can review it and peer reviewers can audit it.

3. Getting to know your data

Before running a hazard ratio or logistic regression, you need to understand the shape of the cohort. Stata provides built-in health datasets for learning. This guide uses bplong, a hypothetical patient cohort tracking blood pressure over time.

Download the companion do-file

I have prepared an annotated Stata script that mirrors the workflow below. Download it, open it in Stata and run the blocks step by step.

Download the Stata script

Type this into the do-file and execute it block by block:

* 1. Clear memory before starting
clear all

* 2. Load the built-in blood pressure dataset
sysuse bplong

* 3. Describe the schema of the dataset
describe

The describe command tells you how many observations exist and which variables are available. It does not tell you whether the data is messy. When loading real clinical data, the first check should usually be codebook.

* Get a granular breakdown of every variable, including missing values
codebook

codebook prints a detailed report for every variable, including unique values, missing observations and any labels attached to numeric categories.

Once you are satisfied the data is clean, generate summary statistics for continuous variables:

* Generate full summary statistics for the continuous blood pressure variable
summarize bp, detail

For categorical epidemiological risk factors, such as smoking status or deprivation quintiles, use cross-tabulations:

* Generate a frequency table for patient sex
tabulate sex

* Generate a cross-tabulation of sex against age group, including row percentages
tabulate sex agegrp, row

Syntax logic: Stata commands follow a consistent structure: [command] [variables], [options]. The comma separates what you want Stata to do from the options that control how it does it, such as , detail or , row.

Further resources