What’s wrong with my current scoring, indexing approach?

Some of the more significant compromises arising from the use of the simple scoring type assessments included:

Without an anchor to absolute risk estimates, the assessment results were useful only in a rather small analysis space. The results offered little information regarding risk-related costs or appropriate responses to certain risk levels. Results expressed in relative numbers were useful for prioritizing and ranking but were limited in their ability to forecast real failure rates or costs of failure. They could not be readily compared to other quantified risks to judge acceptability.
Assessment inputs and results could not be directly validated against actual occurrences of damages or other risk indicators. Even with the passage of time and gaining of more experience, which normally improves past estimates, the scoring models’ inputs generally were not tracked and improved.
Results did not normally produce a time-to-failure, without which there is no technical defense for integrity assessments scheduling. Without additional analyses, the scores did not suggest appropriate timing of ILI, pressure testing, direct assessment, or other required integrity verification efforts.
Potential for masking of effects when simple expressions could not simultaneously show influences of large single contributors and accumulation of lesser contributors. An unacceptably large threat—very high chance of failure from a certain failure mechanism—could be hidden in the overall failure potential if the contributions from other failure mechanisms were very low. This was because, in some scoring models, failure likelihood only approached the highest levels when all failure modes were coincident. A very high threat from only one or two mechanisms would only appear at levels up to their pre-set cap (weighting). In actuality, only one failure mode will often dominate the real probability of failure for a component. Similarly, in the scoring systems, mitigation was generally deemed ‘good’ only when all available mitigations were simultaneously applied. The benefit of a single, very effective mitigation measure was often lost when the maximum benefit from that measure was artificially capped.
Some relative risk assessments were unclear as to whether they were assessing damage potential versus failure potential. For instance, the likelihood of corrosion occurring versus the likelihood of pipeline failure from corrosion is a subtle but important distinction since damage does not always result in failure. A ‘corrosion score’ from the relative models was often unclear in this respect.
Many previous approaches had limited modeling of interaction of variables, a requirement in some modern regulations. Older risk models often did not adequately represent the contribution of a variable in the context of all other variables. Simple summations would not properly integrate the interactions of some variables.
Some models forced results to parallel previous leak history—maintaining a certain percentage or weighting for corrosion leaks, third party leaks, etc.—even when such history might not be relevant for the pipeline component being assessed.
Balancing or re-weighting was often required when new information became available. Older models attempted to capture risk in terms that represent 100% of the threat or mitigation or other aspect. The appearance of new information such as a new mitigation technique required re-balancing which in turn made comparison to previous risk assessments problematic.
Some models could only use attribute values that were bracketed into a series of ranges. This created a step change relationship between the data and risk scores.
Some models allowed only mathematical addition, where other mathematical operations (multiply, divide, raise to a power, etc.) would better parallel underlying engineering models and therefore better represent reality.
Simpler math did not allow orders of magnitude scales and such scales better represent real-world risks. Important event frequencies can commonly range, for example, from many times per year to less than 1 in ten million chance per year. Real world frequencies can often span several orders of magnitude and this was difficult to capture in a scoring model.
An underlying difficulty in the calibration of any scoring type risk assessment was the limitations inherent in such methodologies. Since the scoring approaches usually make limited use of probability distributions and equations that truly mirror reality (see previous discussion on limitations), they would rarely appropriately track ‘real-world’ experience. For example, a minor 1 or 2% change in a risk score may have actually represented an equivalent change (1-2%) in absolute estimates for one threat but a 100-1,000% change in another threat.
Lack of transparency. Ironically, a scoring system added a layer of complexity and interfered with understanding of the basis of the risk assessment. Underlying assumptions and interactions were concealed from the casual observer and required an examination of the ‘rules’ by which inputs were made, consumed by the model, and results generated.