Estimating Life Expectancy at Birth


To estimate life expectancy at birth, by race/ethnicity, gender, and geography, we used information on mortality and mid-year population estimates from the Centers for Disease Control and Prevention’s (CDC’s) Wide-ranging OnLine Data for Epidemiologic Research (WONDER) databases (the Compressed Mortality Data) and constructed abridged life tables. A life table is a table that includes the number of deaths, total population, probability of dying, and remaining life expectancy by single year of age for a given year or time period. Abridged life tables are similar, but present the information for age groups rather than by single year of age. Remaining life expectancy for each age group is largely a function of the probability of dying for people in their own age group and in older age groups.

To prepare the data, we made a series of extractions at the county, state, census region (Northeast, Midwest, South, and West), and national levels to derive data on the number of deaths and mid-year population counts by race/ethnicity or gender, and age group. This was done for both single years (e.g. 2005 and 2015) and for pooled years (e.g. 2001 through 2005 and 2011 through 2015). Multiple years of data were pooled together to improve the accuracy of our estimates at the county level (and the same pooling was applied to the state and national extractions for reasons of comparability). We then used the data to construct abridged life tables following the methodology described in an article by Chin Long Chiang.

Before constructing the abridged life tables, we adjusted the death counts to fill in undisclosed values, account for number of deaths for unknown ages of the deceased population, and account for racial/ethnic misclassification. Our final set of adjusted death counts by age group were used to derive adjusted death probabilities (dividing by corresponding mid-year population counts), which were used to construct abridged life tables and estimate life expectancy at birth (and by age group). Separate estimates were generated for non-Hispanic Whites, non-Hispanic African Americans, non-Hispanic Asians or Pacific Islanders, non-Hispanic Native Americans, males, females, and the total population for counties, metropolitan areas (as well as other regions defined by county groupings), states, and the nation as a whole. This document describes the adjustments made to the underlying data and the methods used to estimate life expectancy in more detail, as well as measures taken to assure data quality.

Estimating undisclosed death counts

In the publicly available information from the WONDER Compressed Mortality Data, the death counts are not disclosed if there are nine or fewer deaths in a given age group. The age groups come in four years increments for ages under 25 and nine year increments for ages 25 or older. Data for the following age groups were extracted:

The construction of an abridged life table for a given population requires death counts and probabilities for all age groups. A series of substitutions were made to estimate death counts and probabilities for undisclosed age groups. First, we substituted the probability of dying from the next highest level of geography with disclosed data (for the given population and age group). For example, if the death rate for non-Hispanic Whites younger than one year old was missing for a county, we applied the rate for population/age group from the corresponding state. Furthermore, if the state-level death rate was undisclosed, then the rate from the corresponding census region was applied. Similarly, if the census-region death count was missing (which was rarely the case), we applied the probability of dying for the nation overall.Additional categories with death counts for persons of unknown age (see below for how we deal with this information) and for all ages combined were also included. For larger counties, states, and the nation as a whole, all of the death counts for each group by age and race/ethnicity or gender were disclosed. For smaller counties and states, however, some death counts were undisclosed – particularly for the younger age groups and for smaller racial/ethnic groups.

Once the probability of dying was imputed for age groups with undisclosed death counts, it was applied to the mid-year population estimate to derive an initial estimate of the number of deaths. Then, the initial estimate of the number of deaths was adjusted to agree with total number of deaths across all undisclosed age groups for the population under consideration. To understand the reasoning behind this adjustment, consider a population in which the death counts for the age groups less than one year and one to four years are undisclosed, and the initial estimated death counts based on applying death rates from a higher level of geography are eight and 13, respectively, for a total of 21. The initial total estimate of 21 is unlikely to match perfectly with the actual number of deaths for undisclosed age groups derived using the original data by subtracting the total across disclosed age groups from the total across all age groups.

Now, suppose that the actual total for undisclosed age groups in this hypothetical example is 18. To ensure that our estimated death counts agree with the known total, our initial estimates are multiplied by an adjustment factor equal to the ratio of the actual total number of deaths across all undisclosed age groups to the total according to our initial estimates. Essentially, this reduces (or increases) our estimates such that their sum agrees with the known total and that the distribution of deaths across age groups remains unchanged. The formula used to achieve this is shown below.

In the example shown above, the final estimated death count for the less than one year age group is 6.86 (=8*[18/21]) and the final estimate for the one to four year age group is 11.14 (=13*[18/21]).

After the above adjustments are made to the estimated death counts for undisclosed age groups, it is possible (though unlikely) for some death counts to be greater than nine. However, we know this cannot be true because death counts are only suppressed if the number of deaths is less than or equal to nine. Thus, for any undisclosed age groups in which our estimated number of deaths was greater than nine, we set the estimate to nine and distributed the remainder across all other undisclosed age groups for which the initially estimated number of deaths was less than nine, in proportion to the estimated number of deaths in those age groups. Since one iteration of this procedure can result in additional undisclosed age groups with an estimated death count of more than nine, we repeated the procedure until all of the undisclosed age groups had an estimated number of deaths of nine or fewer. The maximum number of iterations required to achieve this result for any population/geography included in our analysis is three.

Adjusting for deaths of unknown age

The CDC Compressed Mortality Data include a small number of deaths for which the age of the deceased is unknown. In order to avoid understating death counts by age group we included the small number of deaths of unknown age. Given we have no reason to believe they should be concentrated in a particular age group over another, we distribute them across the age groups following the existing distribution of estimated deaths by age group.

Adjusting for racial/ethnic misclassification

Further adjustments were made to account for racial/ethnic misclassification. Researchers from the CDC at the Division of Vital Statistics suggest that misclassification of race and Hispanic origin on U.S. death certificates results in a net underestimate of about three percent of total Hispanic deaths and less than one-half of one percent of non-Hispanic Black deaths. The CDC analyzed data linking individuals in the Current Population Survey (CPS), which contains self-reported information on race and Hispanic origin, with race and Hispanic origin reported on their death certificates in the National Longitudinal Mortality Study to estimate sex/age-specific ratios of CPS race and Hispanic origin counts to death certificate counts, which they refer to simply as “classification ratios.” They report misclassification ratios by age group and sex for the non-Hispanic white, non-Hispanic Black, and Hispanic/Latino populations. We adjusted our death counts resulting from the procedures describe above to account for racial/ethnic misclassification by multiplying them by the corresponding classification ratios.

Deriving final estimates life expectancy at birth

The procedures described above resulted in completed abridged life tables for counties, states, and the nation as a whole. To generate abridged life tables for metropolitan areas and custom regions (defined by county groupings), information on mid-year population and death counts (including all imputations and adjustments described above) were aggregated to the metropolitan area and regional levels, then the formulas to complete the life tables were applied. To calculate estimated life expectancy at birth, we added 0.5 years to the life expectancy estimate for the less than one year old age group (since the midpoint of that estimate reflects the population age 0.5 years).

Measures taken for quality assurance

Applying death probabilities from higher levels of geography when they are missing in a local geography does amount to ecological fallacy. The approach finds some justification in the fact that estimated life expectancy does exhibit a high degree of spatial autocorrelation. It is also important to point out that remaining life expectancy for any particular age group is not only a function of the probability of dying for that age group, but also for all of the older age groups in the distribution. Furthermore, given that younger age groups are far more likely to be undisclosed than older age groups, even when their death rates are drawn from higher levels of geography, their life expectancy estimates still tend to be based upon original, geographically specific information.

Notwithstanding the above justifications for applying death rates from higher levels of geography to lower, constituent geographies, we did take measures to avoid reporting highly unreliable estimates – that is, estimates with too many substitutions. Specifically, we only report estimates for which at least 90 percent of the total number of deaths for a population are from age groups that had disclosed death counts in the underlying data and did not require substitution of death probabilities from higher levels of geography. We also only report estimates based on at least 100 total deaths (for all age groups combined).