Calculate Median Of Variable In Stata

Calculate Median of Variable in Stata

Use this interactive calculator to find the median from sample values, understand how Stata reports it, and generate the exact command syntax you can use in your workflow.

Median Calculator

Examples: 10, 14, 18, 22 or one number per line. Non-numeric values are ignored if you choose that option below.

Results

Enter your values and click Calculate Median to see the result, sorted values, and suggested Stata commands.

How to calculate median of a variable in Stata

If you need to calculate the median of a variable in Stata, the good news is that the task is straightforward once you understand which command matches your goal. In applied statistics, the median is one of the most useful measures of central tendency because it identifies the middle value in an ordered distribution and is far less sensitive to outliers than the mean. In real-world datasets such as wages, home prices, medical costs, or household wealth, skewness is common. That makes the median especially valuable for policy analysis, academic research, business intelligence, and data quality review.

In Stata, users often calculate medians for one of four reasons: to summarize a single variable, to compare groups, to generate a new variable containing the median by category, or to produce publication-ready output. Although many analysts begin with the simple summarize command, that command does not report the median by default. Instead, Stata users typically rely on summarize, detail, centile, tabstat, egen, or grouped commands such as by and collapse. Choosing the right approach depends on whether you want a descriptive result in the Results window or a transformed dataset that stores medians for later use.

Quickest Stata commands for the median

The fastest way to view the median of a variable is usually:

summarize income, detail

In the detailed output, Stata reports the 50th percentile, which is the median. If you want a direct percentile command, this is also common:

centile income, centile(50)

If you want a compact table of summary statistics including the median, use:

tabstat income, statistics(n mean p50 sd min max)

Here, p50 is Stata’s notation for the 50th percentile. That is the median.

Why the median matters in skewed data

The median is often a better summary than the mean when data are not symmetric. For example, income distributions are usually right-skewed because a small number of very high earners pull the mean upward. The median resists this distortion. According to U.S. Census Bureau reporting, median household income is a standard benchmark for describing the economic center of households because it is easier to interpret under skewed distributions than the arithmetic mean. For researchers working with administrative records, survey microdata, health expenditures, and education outcomes, the median is often the preferred measure when the goal is to describe a typical unit rather than the total mass of the distribution.

Dataset Example Mean Median Interpretation
Household income sample $78,400 $61,200 High earners increase the mean more than the median
Emergency room charges $3,950 $1,780 Extreme claims create a large right tail
Home sale prices $412,000 $356,000 Luxury sales push the mean above the middle case

Notice that in each example the median is lower than the mean. That pattern is typical of right-skewed data. If your Stata variable looks similar, reporting the median can produce a more representative summary of the typical observation.

Step by step: calculate a median for one variable

  1. Open your dataset in Stata.
  2. Identify the variable of interest, such as income, age, or cost.
  3. Run summarize variable, detail.
  4. Look for the 50% line in the percentile output.
  5. Interpret that value as the median.

Example:

use mydata.dta, clear summarize income, detail

In the detailed summary, Stata lists percentiles like 1%, 5%, 10%, 25%, 50%, 75%, 90%, 95%, and 99%. The 50% figure is your median. This method is ideal when you simply want to inspect the variable and understand its distribution at the same time.

Using centile for exact percentile reporting

The centile command is useful when your analysis emphasizes percentiles. It is often used in distributional analysis, inequality research, and clinical outcome reporting. To obtain only the median:

centile income, centile(50)

You can also request several percentile points in one line:

centile income, centile(25 50 75)

This gives you the first quartile, median, and third quartile. For many reports, this is more informative than a single central tendency measure because it shows the middle spread of the data.

How to calculate median by group in Stata

A very common task is to calculate the median of a variable within categories such as sex, region, treatment status, or year. There are multiple ways to do that. If you want a display table:

tabstat income, by(region) statistics(n p50 mean sd)

If you want to create a variable containing the group-specific median for each observation, use egen:

bysort region: egen median_income = median(income)

This command is powerful because it writes the median back into the dataset. Every observation in the same region will receive that region’s median income. That makes later modeling, deviation checks, and chart preparation much easier.

Goal Recommended Stata command Main advantage
See one variable’s median summarize var, detail Quick and familiar
Report percentiles precisely centile var, centile(50) Focused percentile output
Compare medians across groups tabstat var, by(group) statistics(p50) Clean grouped table
Create a new grouped median variable bysort group: egen newvar = median(var) Stores values for later analysis
Collapse data to medians collapse (median) var, by(group) Produces a smaller summary dataset

Median versus mean in Stata output

Analysts sometimes confuse the purpose of the mean and the median because both are measures of center. In Stata, summarize var returns the mean but not the median. This is efficient for normal or approximately symmetric data, yet insufficient for skewed variables. If your data include large outliers, top-coded values, or long tails, median reporting is usually the better descriptive choice. In economics, epidemiology, public policy, and social science, it is common to present both numbers together so readers can assess skewness quickly.

A practical rule is this: if the mean and median are close, the distribution may be fairly symmetric. If they differ greatly, check the distribution more carefully with histograms, box plots, and percentile summaries. Stata makes this easy by combining descriptive statistics with graphics.

How missing values affect the median

Stata generally excludes missing numeric values from statistical calculations, including median estimates. That is usually what you want, but you should still inspect the amount and pattern of missingness. If a substantial share of observations is missing, the reported median may not represent the underlying population well. For that reason, many analysts first run commands such as:

misstable summarize income count if missing(income)

Then they calculate the median only after understanding whether the available data are complete enough for sound interpretation.

Creating publication-ready median results

If your goal is reporting rather than exploration, consider storing and exporting the results. A simple workflow is to use tabstat or table for display and then export with Stata’s reporting tools. Another strategy is to calculate medians into a reduced file with collapse. Example:

collapse (median) income age, by(region year)

This command transforms the dataset into one row per region-year combination with medians for income and age. It is extremely useful for dashboards, panel summaries, and trend charts.

Common mistakes when calculating medians in Stata

  • Using summarize without detail and assuming the mean is the median.
  • Forgetting to sort or group correctly when using bysort.
  • Ignoring missing values and data entry errors before computing the statistic.
  • Using a grouped median variable in regression without understanding what it represents.
  • Collapsing the data without saving a backup of the original dataset first.
Tip: When you use collapse, Stata replaces your working dataset with the summarized version. Save first or use a temporary copy if you need the original observations later.

Recommended workflow for robust median analysis

  1. Inspect variable type and labels using describe.
  2. Check missingness and invalid values.
  3. Run summarize, detail to inspect percentiles and spread.
  4. Compare mean and median to assess skewness.
  5. Use tabstat or egen if grouped summaries are needed.
  6. Document the command syntax in your do-file for reproducibility.

Example Stata code patterns

* Single variable median summarize income, detail * Direct percentile approach centile income, centile(50) * Grouped medians in a table tabstat income, by(sex) statistics(n p50 mean sd) * Create a median within each state bysort state: egen state_median_income = median(income) * Reduce data to median by year collapse (median) income, by(year)

Authoritative references and statistical context

For broader statistical context, official and academic sources explain why medians are so widely used in public data reporting. The U.S. Census Bureau regularly publishes median-based income summaries because they are more stable under skewed distributions. The U.S. Bureau of Labor Statistics provides labor market and earnings statistics where medians are often more interpretable than averages. For foundational statistical learning, the Penn State Department of Statistics offers clear instruction on medians, percentiles, and distribution shape.

Final takeaway

To calculate the median of a variable in Stata, the best command depends on your task. If you just want the number, use summarize var, detail or centile var, centile(50). If you need a grouped display, use tabstat. If you want to store medians in your dataset, use egen. If you want a collapsed summary dataset, use collapse. The median is especially useful for skewed or outlier-prone data, making it a core descriptive statistic in serious quantitative analysis. The calculator above helps you verify the underlying math and gives you Stata-ready syntax you can adapt immediately.

Leave a Reply

Your email address will not be published. Required fields are marked *