Introduction

Pipeline Risk algorithms can take in hundreds of variables and perform hundreds more calculations to determine the probability of failure for a given segment of pipe. Trying to determine what threats are biggest contributor to a system’s risk or even over the length of a given line can be challenging. A given threat might be high in one location and then exceeded by another a short distance away. It can be compared to trying to predict the height of waves that are rising and falling, seemingly at random. Trying to do this by just looking over a large table of numbers is a exercise in frustration and tedium. This is further complicated by the fact that it can be the interaction of threats that drive the overall probability of failure (PoF).

Data

To start out, here is a random sample from a one-million dynamic segment risk results. It includes the PoF’s for the nine threats and then an overall PoF for that segment. As you will typically see in a diverse system is that threats that are high in one location, may be considerably below average in another. So if I wanted to know what threats correlate to higher risk, it would be an exercise in futility to try and comb through one-million records and try to pull meaning out of them. Fortunately, there exists methods that allow for the exploration the data set and find trends without having to manually sift through enormous tables of information. The technique that will be discussed as the main focus of this article is a widely used machine learning tool used in data mining called decision trees.

Sample Data Table
EC IC SCC Man Con TP Eq IO Nhaz PoF
0.90 0.04 0.61 1.78 0.13 1.40 2.36 0.91 4.02 7.67
0.16 5.64 0.00 11.20 0.30 9.63 3.00 0.12 0.17 15.12
22.75 7.83 3.98 15.61 0.52 8.46 5.14 0.66 5.67 41.35
0.11 1.03 0.03 17.96 0.79 16.43 7.73 0.78 1.77 19.54
7.80 2.03 2.27 5.81 0.35 42.84 3.61 0.94 7.50 53.77
0.02 5.12 0.00 0.72 0.06 0.01 1.09 0.26 0.10 5.47
5.06 9.43 0.82 2.54 0.21 0.43 2.04 0.55 1.48 17.10
5.60 8.67 4.19 9.24 0.31 0.53 3.01 0.95 8.65 25.66

Data mining and Machine Learning

Let’s start by explaining what is meant by those two terms and then move into the specifics of decision trees. Data mining is exactly what the name implies, it’s a broad classification of methods used to try and extract meaningful trends from raw data. Examples of data mining could be histograms or linear regression. Machine learning takes data mining the next step of using the computer to “learn” patterns without being explicitly being programmed to. A concrete example of this would be the movie streaming service that provides recommendations for new movies you might like based on what you’ve watched in the past. So using these tools we can learn what’s driving the overall risk of the system or part of the system. Then later on we will extend the process to infer what variables are having the largest effect on an individual threat for a given line.

Decision Trees

The best way to introduce decision trees is with an initial sample that is easy to understand. The following decision tree is based on a data set of the passengers of the Titanic. The way to think of these is an upside down tree with root at the top and the leaves at the bottom. Reading from the top down, we can see that out of the total passengers on the Titanic 38% (0.38) survived. Then as we move down the tree,it shows what a difference sex, passenger class (plass) and age have on the survival rate. Just by being male your chances of survival drop in half from 38% to 19% and accounted for 64% of the population. Looking at the next decision node on the male=yes branch, if your age is greater than 9.5 it drops further to 17%. You will see that on this branch of the tree passenger class does not show up. What this implies that if you were male passenger class had no affect on your chances of survival. Similarly on the male=no branch, age doesn’t show up as a classifier but passenger class does. This could be extended further to include other information such as number of siblings or embarkation location to find further interactions of variables but for the purposes of introduction we’ll stop here and move on to more concrete examples using pipeline risk data.