Chapter 8: Index Development
The approach taken here for development of an index for assessment is called the multimetric approach. Biological attributes, or metrics, are calculated from the measurements. A score is assigned to each metric corresponding to its deviation from the expected value in reference sites. The multimetric index is the sum of all the metric scores. A separate index is developed for each assemblage sample (e.g., macrophytes, benthic invertebrates, fish).
The multimetric approach has been successfully applied to assessment of stream fish assemblages (Karr 1981, Karr 1991, Karr et al. 1986) and stream invertebrate assemblages (Ohio EPA 1987, USEPA 1989b, Barbour et al. 1995, Yoder and Rankin 1995). The approach appears to be statistically robust (Fore et al. 1994) and is straightforward to apply. Alternative methods of analysis and assessment are discussed in Appendix E.
Development of a multimetric index is the final step toward operational bioassessment. Three steps are necessary for development of an index: characterization of reference conditions, evaluation and final selection of metrics, and multimetric index building.
The basis of the multimetric approach is comparison of a metric to an expected (reference) distribution of values and a judgement of whether the value is within the expected range. Each metric is given an ordinal score of 5, 3, or 1, depending on whether it is similar to reference values (within the expected range), is somewhat different, or is very different, respectively (Figure 8-1).
Figure 8-1. Basis of bioassessment scores—unimpaired reference sites; population distribution.
The expected range is usually expressed as a percentile of the reference distribution. Two methods of scoring are commonly used. The first is based on a lower percentile of a representative sample of reference sites (Figure 8-la). The second method is used if predetermined reference conditions are not definable or if there are too few reference sites, and it is preferred for defining reference conditions for reservoirs. In the first method (Figure 8-1a), the 25th percentile of the reference site distribution is often used as the dividing line between optimal (similar to reference) and less than optimal. In the second method (Figure 8-1b), the 95th percentile of the entire population distribution is often used as the reference mark for trisecting metric values (e.g., Karr et al. 1986).
The index consists of the sum of all metric scores, and the total index value of a site is compared to the distribution of index values in reference conditions. Development of an index thus requires characterization of reference conditions to obtain the distributions of metric values, final selection of metrics based on metric response to stressors, and, finally, characterization of the index distribution in reference conditions.
Selection of metrics and development of a multimetric index requires a test data set composed of reference sites and nonreference (test) sites. The best sites may be impaired or may simply not meet the criteria for reference sites. Ideally, the test sites should include at least some lakes that are severely impaired by different stressors. If, for example, all test sites are eutrophic lakes, then the response of metrics to other stressors cannot be determined. Reference condition characterization uses only the reference site data - metric evaluation and index development use both reference and test site data.
8.2 Characterization of Reference Condition
The objective of reference characterization is to finalize the classification of the reference sites and to describe (characterize) each of the lake categories in terms of metrics and other descriptive variables.
Several statistical tools can assist in the classification of sites, but there is no one set procedure. If the preliminary classification is relatively certain (based on well-developed prior knowledge and professional judgment, and graphical analysis of metrics) followed by necessary modifications and tests of the resultant classification, is usually sufficient to finalize the classification. If the preliminary classification is less certain, it might be necessary to develop a classification from the data, using one of several classification methods. These methods include cluster analysis and several ordination methods (e.g., principal components analysis, correspondence analysis, multidimensional scaling; Appendix E). Ordination is also useful for visualizing alternative a priori classification schemes.
8.2.1 Graphical Analysis
A key analysis method for biological metrics is graphical displays using box-and-whisker plots (e.g., Figure 8-1). In the form used here, the central point is the median value of the variable; the box shows the 25th and 75th percentiles (interquartile range); and the whiskers show the minimum to the maximum values (range). A common alternative is whisker extending to values within the “inner fence” (see Tukey 1977 for explanation); this method also plots outliers. Box-and-whisker plots are simple, straightforward, and powerful, and the interquartile ranges are used to evaluate whether there is a real difference between two areas and whether a metric is a good candidate for use in assessment. Graphing the data should always be a first step in data analysis.
Statistical methods used by biologists are frequently tests of whether two or more populations have different means using t-tests, analysis of variance, or various nonparametric methods. However, the fundamental problem of biological assessment is not to determine whether two populations (or samples) have a different mean, but to determine whether an individual site (lake) is a member of the least-impaired reference population. If it is not, then a second question is how far it has deviated from that reference. Therefore, biological assessment requires the entire distribution of a metric, which is effectively displayed with a box-and-whisker plot.
In operational bioassessment, metric values below the lower quartile of reference conditions are typically judged impaired (e.g., Ohio EPA 1990). The actual percentile chosen (25, 10, or 5) is arbitrary and reflects the amount of uncertainty a monitoring program can tolerate.
The preliminary classification is refined through inspection of plotted data (graphical analysis), professional judgment, and statistical tests of final classification hypotheses. First, the values and distribution of metrics are compared among ecoregion or lake type. Regions that appear to be similar to each other can be lumped together for final classification. For two regions to be lumped, most of the metric distributions must be similar. In addition to box plots of metrics, it is also useful to examine scatter plots of selected metrics and habitat variables such as lake size, salinity, or alkalinity. The number of taxa in a waterbody is often dependent on its size, for example, large lakes have more zooplankton species than small lakes (Dodson 1992). Salinity also influences the number of species found in aquatic systems, as do pH and alkalinity.
Refining the Classification
In sampling fish from reservoirs of the Tennessee Valley Authority, the number of fish species was found to vary by reservoir class and ecoregion (Hickman and McDonough 1996). Figure 8-2 (after Hickman and McDonough 1996) shows the number of fish species in different parts of four groups of TVA reservoirs. First, the number of fish taxa is relatively homogenous between forebay, transition, and inflow zones (Figure 8-2). The reservoir types differ in number of fish species, with the mainstream reservoirs having the most species, and the Blue Ridge reservoirs being relatively depauperate. Based on number of species, the Interior Plateau reservoirs are not significantly different from Ridge and Valley reservoirs, and TVA reservoirs could be considered to be in three groups (dotted lines). However, on the basis of other considerations, TVA has kept Interior Plateau reservoirs separate from Ridge and Valley reservoirs.
Figure 8-2. Species richness in TVA reservoirs (redrawn from Hickman and McDonough 1996). Four reservoir classes are shown (mainstream, Interior Plateau, Ridge, and Valley, and Blue Ridge). Dashed lines delineate three classes based on species richness alone. FB = forebay; TR = transition; IN = inflow.
Refining the Classification-Covariates
Certain physical or chemical attributes can have a strong influence on biological metrics, especially number of taxa metrics. The most important of the physical-chemical attributes to test are lake size, salinity (in arid regions), and alkalinity or pH. The example (Figure 8-3) shows number of taxa of benthic macroinvertebrates as a function of salinity in the littoral zone of Montana lakes and wetlands (Stribling et al. 1995). Finding a relationship as in Figure 8-3 requires adjusting reference expectations as a function of the covariate salinity in this case.
Figure 8-3. Benthic macroinvertebrate taxa richness in littoral zone of Montana lakes and wetlands.
8.3 Index Development
Following classification and characterization of reference conditions, metrics are evaluated for suitability in a multimetric index. Suitable metrics are those that respond in a predictable way to stressors on the system and that have low noise or variability.
8.3.1 Metric Variability
Metrics that are too variable within the reference sites are unlikely to be effective for assessment. A measure of metric variability is the ratio of the interquartile range to the distance between the lower quartile and the minimum possible value of the metric.
In operational bioassessment, metric values below the lower quartile of reference conditions are typically judged as not meeting reference expectations (e.g., Ohio EPA 1990). The range from 0 to the lower quartile can be termed a “scope for detection.” For those metrics with low values under reference conditions and high values under impaired conditions, the scope for detection is the range from the 75th percentile to the maximum possible value (e.g., 100 percent) (Figure 8-4).
Figure 8-4. Assessing candidate metrics. a. Metrics that have high values under unimpaired conditions. b. Metrics that have low values under unimpaired conditions.
The larger the scope for detection, compared to the interquartile range, the easier it will be to detect deviation from the reference condition. The “interquartile coefficient” is thus defined here as the ratio of the interquartile range to the scope for detection. The interquartile coefficient is analogous to the coefficient of variation and is used the same way, but it is bidirectional and uses percentiles in the same way that assessment uses percentiles. In general, an interquartile coefficient greater than 1 indicates excessive variability of a metric.
8.3.2 Metric Response
Response of metrics to stresses is evaluated by comparison of reference sites to test sites. The simplest comparison is using box-and-whisker plots of the metric distribution in reference and test sites (Figure 8-5).
Figure 8-5. Responsiveness of metrics. A large difference between reference and impaired test sites indicates a responsive metric. Unknown sites are a mixture of impaired and unimpaired sites.
Alternatively, it may be possible to develop an empirical model of metric response to stressors. Several approaches are available including multiple regression, canonical correlation, canonical correspondence analysis, and log-linear models (Ludwig and Reynolds 1988, Jongman et al. 1987). For multivariate model building, refer to the above references or any statistical software package - it will not be outlined further in this document.
Variability and Uncertainty
Variability in values of measurements and metrics results in uncertainty of the assessment. Uncertainty can be reduced by increasing the sampling effort (repeated measurement) to obtain a better estimate of the mean value. This is especially important for the measurements that are the most variable: chlorophyll, nutrient concentration phytoplankton and zooplankton. Algal abundance and biomass may vary tenfold within the growing season (i.e., Wetzel 1975, Hecky and Kling 1981). A tenfold change in chlorophyll corresponds to 22.6 points in the TSI range, a substantial change.
Because of this variability, Tier 1A is unreliable for assessment of an individual lake and Tier 2A is recommended. Tier 1A is appropriate for assessing a class of lakes or a region, to answer questions such as: what is the status of lakes in the region, or how many lakes are oligotrophic?
As long as many lakes are sampled, the effect of errors in individual lakes is reduced in the evaluation of all lakes.
Metrics are judged responsive if there are significant differences in central tendency or in variance between reference and test sites (Figure 8-5). If the test sites are known to be impaired, then the mean or median values should be significantly different (Figure 8-5). If the test sites are simply lakes that do not meet reference criteria (i.e., they might be a mix of impaired and unimpaired lakes; shown as “unknown test sites” in Figure 8-5), then the variance in the test sites should be larger than that in the reference sites.
Metrics that are responsive to known or unknown stresses are retained for index development. Finally, responsive metrics are evaluated for redundancy. A metric that is highly correlated with another metric might not contribute new information to the assessment. Pairs of metrics with correlation coefficients greater than 0.9 should be examined carefully to determine whether both metrics are necessary. Often, strongly correlated metrics are calculated from the same raw data, or their method of calculation ensures correlation. For example, Shannon-Wiener diversity and percent abundance of the dominant taxon are strongly correlated in any data set.
A correlation alone (say, r >0.6) is not sufficient to eliminate one of a pair of correlated metrics. Some metrics might be sensitive only at severe or moderate stress; others might be sensitive across the entire range of stresses (Karr 1991). These would all contribute information, in spite of strong correlation. A scatterplot of correlated metrics is examined; if there is an apparent nonlinear or curved relationship, then both should be retained. If the points all fall close to a straight line, then one of the metrics can be safely eliminated.
8.3.3 Scoring and Index Development
Combining unlike measurements is possible only when the values have been standardized by a transformation through which measurements become unitless (Schuster and Zuuring 1986). Standardization of these measurements into a logical progression of scores is the typical means for comparing and interpreting unlike metric values.
Two methods are commonly used for scoring metrics, which are based on the metric distribution in defined reference sites or in the population of sites, respectively. Each metric is given a score of 1, 3, or 5, corresponding to impaired, intermediate, or unimpaired biota, respectively (Figure 8-1).
Bisection scoring - (Figure 8-la) Based on a lower percentile of the reference distribution; for example, the 25th percentile (Barbour et al. 1996b). In this method, values above the 25th percentile are considered unimpaired (similar to reference conditions) and values below the 25th percentile are considered impaired to some degree. The range from 0 to the 25th percentile is bisected, with values in the top half receiving a score of 3 and those in the bottom half receiving a score of 1 (Figure 8-la).
Trisection scoring - (Figure 8-1b) Based on the 95th percentile of the population distribution (Karr et al. 1986). Metric values from 0 (or the lowest possible value) to the 95th percentile are trisected; values in the top one-third receive a 5, values in the middle third receive a 3, and values in the bottom third receive a 1 (most impaired).
The scoring method should reflect how well the reference sites represent unimpaired conditions. If reference sites are unimpaired and considered to be representative, bisection is recommended (Figure 8-la). This method assumes that the reference sites are representative of relatively unimpaired conditions and that the metric distribution reflects natural variation of the metric. A value above the cutoff is then assumed to be similar to reference conditions. The lower quartile (25th percentile) is most frequently taken as the cutoff (e.g., Barbour et al. 1996b).
The trisection method (Figure 8-1b) is best for scoring in regions where impacts might be so pervasive that nearly all reference sites are thought to be impacted or for assessment of reservoirs where reference sites cannot be defined. In trisection, it is assumed that at least some reference lakes attain an excellent value for the metric, but that many reference lakes are impaired and hence the lower limit of the reference distribution is not known. The 95th percentile is taken as the “best” value, and the range is trisected below it (Figure 8-1b). Choice of scoring method should be based on confidence in the reference sites, rather than on the method that will produce the most conservative or most liberal scoring. If confidence is high that reference sites are representative of relatively unimpaired conditions, then the lower percentile cutoff and bisection are preferred. If confidence is low, then trisection below the 95th percentile is preferred.
If covariates such as lake size determine metric values, then the scoring should be adjusted for the covariates. Reference data are plotted as in Figure 8-6, and a locally weighted estimate is made of the appropriate percentile (95th or 25th) and the range below it is trisected or bisected accordingly. Figure 8-6 shows total zooplankton taxa in North American lakes ranging in size from 4m2 to nearly 1011m2 (Lake Superior) (Dodson 1992). Few state assessment programs are likely to include lakes smaller than 104m2 (1ha; 2.47 acres), nor larger than 109m2 (1000km2; 247,000 acres). In this example, considering only the middle range from 104m2 to 109m2, the slope is not apparent and adjusting for lake area would not be necessary.
Figure 8-6. Total crustacean and zooplankton taxa in North American lakes (redrawn after Dodson 1992). If metrics show a relationship such as this with area, elevation, or some other physical covariate, then reference expectations must be adjusted to the covariate. The three lines show one possible method for scoring. In practice, most state assessment programs are not likely to span 10 orders of magnitude in lake area.
The index is the sum of the scores of the selected metrics. The number of metrics in an index affect the variability of the index - those with more metrics tend to be less variable (Karr 1991). Index values are evaluated by comparison to index values of the reference sites. Even the best reference sites do not receive perfect scores of the index. The final index scores are compared to the distribution of scores in the reference sites. Criteria for assessment are based on the distribution of index scores in reference sites. Those that correspond to the range of index values in reference sites support life use; those that are clearly below index values in reference sites do not support life use. Following appropriate review and revision, they can be established as biocriteria.
8.4 Lake Tier Indices
An index is calculated for each assemblage sampled. Each tier has three to six indices, which should all be reported. The indices can be summed into an overall lake index, which can be used to report overall condition but would not reveal the condition of the component assemblages. Indices within each tier might or might not be multimetric; Tier 1 indices are primarily single metrics, whereas indices of Tiers 2A and 2B might be composed of 3 to 12 metrics.
8.4.1 Tier 1
Tier 1 assessment consists of trophic state algal growth potential and macrophyte indices. Three TSI (chlorophyll, Secchi depth, and total phosphorus) are recommended; the fourth (total nitrogen) is also recommended in regions where nitrogen is suspected to be a limiting nutrient for algal growth. The TSI and AGPT are scored as metrics for their similarity to reference conditions, and the scores are summed for a “Trophic Reference Index.”
The trophic metrics are unique in that they may be scored lower if their values are substantially higher as well as lower than reference values. For example, an unproductive (oligotrophic) lake in a region where lakes are expected to be productive (mesotrophic) would be given a lower score.
Tier 1 has two or more submerged macrophyte metrics, percent cover of macrophytes, and dominance of exotic species. More metrics can be developed if macrophyte species are identified and relative abundances are estimated. Percent cover is scored by comparison to reference expectations, but dominance of exotic species is rated 5 if none are present, 3 if exotics are subdominant, and 1 if exotics are dominant. The two macrophyte metrics are summed for the Tier 1 macrophyte index.
Lakes are assessed from the scores of the two Tier 1 indices. Tier 1A and Tier 1B use the same metrics and indices; Tier 1B trophic metrics are estimated from seasonal mean measurements. Biocriteria can be established for further investigation or remedial action, based on the scores.
8.4.2 Tier 2A
Tier 2A assessment may consist of three to five indices:
- Trophic reference index of either Tier 1A or 1B.
- Macrophyte index of Tier 1 or a more detailed Tier 2A macrophyte index.
- Benthic macroinvertebrate index.
- Fish assemblage index.
Sedimented diatom index. The macroinvertebrate, fish, and diatom indices are developed from metrics as described in Chapter 6.
8.4.3 Tier 2B
Tier 2B consists of three to five indices:
- Trophic reference index of Tier 1B (seasonal averages).
- Macrophyte index of Tier 1.
- Phytoplankton index.
- Zooplankton index.
- Periphyton index.
The phytoplankton, zooplankton, and periphyton indices are developed from metrics as described in Chapter 6.
Case Study: TVA Scoring Criteria and Index Development
The classification scheme used to develop expectations for chlorophyll in Tennessee Valley reservoirs was based on the “natural” nutrient level in a watershed. Professional judgment was used to select concentrations considered indicative of good, fair, and poor conditions. Reservoirs were placed into one of two classes for chlorophyll expectations: those expected to be oligotrophic because they are in watersheds with naturally low nutrient concentrations, and those expected to be mesotrophic because they are in watersheds which naturally have greater nutrient availability. The reservoirs expected to be oligotrophic are those in the Blue Ridge Ecoregion. The remaining reservoirs, both mainstream reservoirs and tributary reservoirs, are expected to be mesotrophic.
The range of concentrations selected to represent good, fair, and poor conditions is much lower for reservoirs in nutrient-poor areas (e.g., Blue Ridge) than for the other reservoirs. For reservoirs expected to be mesotrophic, the concern is that chlorophyll levels not become too great because of the associated undesirable conditions—dense algal blooms, poor water clarity, low DO, and noxious blue-green algae. Conversely, in cases where sufficient nutrients are available but chlorophyll concentrations remain low, there is likely something hindering this natural process. This is the reason for identifying a minimum level for the “good” range of expectations for mesotrophic reservoirs.
The sediment quality scoring criteria uses sediment chemical analyses for ammonia, heavy metals, pesticides, and PCBs.
Seven assemblage characteristics (or metrics) were selected to evaluate the benthic macroinvertebrate assemblage. Six of the metrics are an average of the 10 samples taken at each site.
Scoring criteria for each of the seven metrics were developed using the 5 years of Vital Signs monitoring data (1994-1996). Scoring ranges were developed as follows:
Professional judgment and supplementary statistical analyses were used to adjust the cutoffs for each range as appropriate. Sample results at each site were compared with these criteria for each metric and assigned the rating described above—5 = good; 3 = fair; 1 = poor if they fell within the top, middle, or bottom group, respectively. Numerical ratings for the seven metrics were then summed. This resulted in a minimum score of 7 if all metrics at a site were poor, and a maximum score of 35 if all metrics were good.
Reservoir Fish Assemblage Index
The current RFAI uses 12 fish assemblage metrics from five general categories, including:
Species Richness and Composition
Establishing scoring criteria (reference conditions) by trisecting observed conditions requires a substantial data base for each class of reservoir and assumes the data base contains reservoirs with conditions ranging from poor to good for each metric. The smaller the number of reservoirs within a class, the less likely these assumptions can be met and the greater the need for sound professional judgment based on extensive knowledge of the reservoir assemblages being studied.
Because some reservoir classes contained relatively few reservoirs, the approach used to develop scoring criteria for RFAI was to include all sampling results from Vital Signs monitoring (1990-1994). A slightly different approach was used for species richness metrics than for abundance and proportional metrics. For species richness metrics, a list was made of all species collected from comparable locations within a reservoir class from 1990 to 1994. This species list was adjusted using inferences of experienced biologists knowledgeable of the reservoir system, resident fish species, susceptibility of each species to collection methods being used, and effects of human-induced impacts on these species. This effort resulted in a list of the maximum number of species expected to occur at a sampling location and be captured by collection devices in use. Given that samples are collected once each year, this maximum number of species would not be expected to be represented in that one collection. Therefore, the range from 0 to 95 percent of the maximum was trisected to provide the three scoring ranges (good, fair, and poor). Although 95 percent of the maximum number of species at a site would not be expected to be collected in one sampling event, this “high” expectation was adopted to keep these metrics conservative in light of potential uncertainties introduced by relying heavily on professional judgement.
Scoring criteria for proportional metrics and the abundance metric were determined by trisecting observed ranges after omitting outliers. Next, cutoff points between the three ranges were adjusted based on examination of frequency distributions of observed data for each metric along with professional judgment. In some cases, the narrow range of observed conditions required further adjustment based on knowledge of metric responses to human-induced impacts observed in other reservoir classes. Scoring criteria for the fish health metric are those described by Karr et al. (1986).
To develop metric scores for number of taxa, reproductive composition, and fish health metrics, electrofishing and experimental gill net sampling results were pooled prior to scoring. For abundance and proportional metrics, electrofishing and gill netting results were scored separately, then the two scores averaged to arrive at a final metric value. These scoring criteria separated sites into three categories assumed to represent relative degrees of degradation. Sample results are compared to these reference conditions and assigned a corresponding value: good = 5, fair = 3, and poor = 1.
To arrive at an overall health evaluation for a reservoir, the sum of the ratings from all sites are totaled, divided by the maximum potential ratings for that reservoir, and expressed as a percentage. For example, for a small reservoir with only one sample side, the health evaluation would be 20% (all five indicators rated poor—1 for a total score of 5 divided by the maximum possible total of 25) and the maximum would be 100% (all five indicators rated good—5). This same range of 20 to 100 percent applies to all reservoirs regardless of the number of sample sites, and the same calculation process is used.
The next step is to divide the 20 to 100 percent scoring range into categories representing good, fair, and poor ecological health conditions. This has been achieved as follows:
Results are plotted and examined for apparent groupings.
Groupings are compared to known, a priori conditions (focusing on reservoirs with known poor conditions), and good-fair and fair-poor boundaries are established subjectively.
The groupings are compared to a trisection of the overall scoring range. A scoring range is adjusted up or down a few percentage points to ensure a reservoir with known conditions falls within the appropriate category. This is done only in circumstances where a nominal adjustment is necessary.
These methods have been in use for 6 years. Each year slight modifications are made in the original evaluation process and the numerical scoring criteria for each of the five ecological health indicators (Table 8-1) based on experience gained from working with this process, review of the evaluation scheme by other professionals, and results of another year of monitoring. As a result, scoring ranges have changed slightly over the years. Low DO and poor benthos quality contributed most to poor scores among tributary reservoirs in 1994 (Figure 8-7). Reservoir health ratings also differed among ecoregions (Figure 8-8), with run-of-river reservoirs typically scoring highest.
Table 8-1. Example of TVA’s computational method for evaluation of reservoirs: Wilson Reservoir 1994 (run-of-the-river reservoir).
Figure 8-7. Overall ecological condition of tributary reservoirs in the Tennessee Valley in 1994.
Figure 8-8. 1994. TVA ecological condition summary.
Home ~ Preface ~ Chapter 1 ~ Chapter 2
Chapter 3 ~ Chapter 4 ~ Chapter 5 ~ Chapter 6
Chapter 7 ~ Chapter 8 ~ Chapter 9 ~ Chapter 10
Appendix A ~ Appendix B ~ Appendix C ~ Appendix D
Appendix E ~ Appendix F ~ Appendix G