MIG Forensic 2020 – Synthèse¶

Taha Balkhi, Pierre-Louis Binachon, Thibaut Caillierez, Tristan Cotillard, Marguerite Dejean de La Bâtie, Louis Delmas, Anatole Hernot, Caroline Jeandat, Viviane Lesbre, François Mazé, Paul Meddeb, Victor Pachy, Alexis Sisca

This document is a brief summary of our notebooks and more generally of the work carried out during our project. For more information on how the algorithms work, the results obtained and the code of the algorithms, we invite the reader to refer to the notebooks related to each step of our work.

Context and presentation of the problem¶

Air Liquide is a multinational industrial company specialized in industrial gases, particularly used in industry, healthcare, environment, and research. Air Liquide is active in 80 countries and employs about 67,000 people across 500 sites. As a global company, Air Liquide runs several industrial sites all around the world. Nowadays, 21 sites in France and some more in Italy are controlled remotely from a control center near Lyon.

As for any industrial group, and particularly since Air Liquide uses remote control, it is crucial for the group to be able to detect effectively and rapidly anomalies in the physical signals of their machines.

Since 2016, Air Liquide has been achieving its digital revolution with a plan called Smart Innovative Operations (SIO) in all his industries. The SIO program features four main components: SIO.Drive (remote control of the industrial sites), SIO.Optim (optimization of production to consume the least energy possible), SIO.Perform (analysis of the production performances) and SIO.Predict. SIO.Predict consists in achieving predictive maintenance. It aims to detect anomalies as soon as they appear in order to gain time and be able to repair the machine without impacting production, therefore avoiding economic losses. Indeed, shutting down a machine implies halting the customer's otherwise uninterrupted or limited supply. In Air Liquide Large Industries activity, customers are linked directly to Air Liquide’s factory using a pipeline. The production cannot tolerate any supply interruption. By detecting an anomaly right from its start, maintenance operations will be able to be done when the customer does not need the gases, thus making Air Liquide's maintenance process more secure and efficient. With a broader perspective, predictive maintenance allows the company to drastically reduce the cost of regular servicing: the machines are examined whenever an anomaly is detected and systematic checks, which force a complete stop of the system, therefore decrease.

To achieve this goal, SIO.Predict uses a software called PRiSM which defines a “normal” operating area thanks to the past normal data and which raises an alarm when the current parameters exit this area for a certain period of time (1st criterion). In addition, and even if the parameters stay within the desired area, PRiSM raises an alarm if one value varies too quickly (2nd criterion).

Despite its efficiency in detecting most anomalies, Air Liquide's implementation of PRiSM presents a specific flaw when parameters start to oscillate, whithin the normal range of values. These anomalies are rare but can cause major industrial issues. Slow oscillating anomalies are too slow to be considered an anomaly (2nd criterion) but their time period is still too short for the signal to stay out of the “normal” area for long enough (1st criterion). Resultantly, slow-oscillating — and often highly concerning — signals cannot be detected, or at least not before some time has passed.

This issue was observed on oil temperature measurement of a nitrogen compressor, due to a phenomenon called varnishing. When the temperature increases, little crystals form in the lubrication oil (due to impurities) and hold onto metallic parts, thereby preventing the oil from flowing. The lack of lubrication generates even stronger friction between metallic parts, thus increasing even more the temperature. The crystals eventually end up being detached by friction which re-establishes oil circulation and leads to a temperature decrease, hence the oscillating movement.

Both safety and economic gains could be achieved by developing an algorithm able to detect this type of oscillating anomalies. Solving this problem was the aim of our project, in which we developed a solution to the following question:

How can such oscillating anomalies recorded within a set of sensor data be detected with absolute certainty?

Our method¶

Forensic engineering¶

Let’s start by introducing Forensic Engineering. It can be defined as the retrospective investigation into industrial malfunctions and failures. The general purpose of a forensic engineering investigation is to locate one or several causes of failure in order to improve performance and avoid another accident. As mentioned, Air Liquide’s software, PRiSM, has got difficulties detecting the slow and frequent oscillations that may at a certain point lead to an industrial disaster with a risk of serious injuries and significant damage to the involved machines. Applying the Forensic approach, we collected various data from Air Liquide’s database and attempted to identify the features best suited to detecting the problem. We had sensor data spanning over 3 years and for 13 different physical values available to us. We chose one specific temperature as the variable of interest (main feature) for the different algorithms, since it was the one for which the issue to detect was most obvious due to the direct correlation between the varnishing phenomenon and the measured temperature. For some algorithms based on neural networks, other variables were required; these are explanatory variables (features that help the prediction of points and which must be as uncorrelated as possible).

The algorithms we programmed are classification algorithms: the data collected by Air Liquide may or may not correspond to an anomaly and are classified as such by our algorithms. Generally speaking, the objective of classification is to predict the class $y$ of an object $x$ as well as possible from observations on the latter. To help understand, let us describe for a moment the mathematical modelling of a generic classification problem.

Let $H$ be a measure space. To each point $X$ corresponds a label $Y \in \{0, 1\}$. The goal of statistical learning is to construct a function $$ h : H \to \{0,1\} $$ called a classifier, such that $h(X)$ best predicts the label of $X$.

Suppose that the pair $(X,Y)$ is derived from a draw according to a $P$ law. For a given classifier, the probability of misclassification is $L(h) = P(Y \neq h(X))$. We are therefore looking for a classifier which minimises $L(h)$.

If we know the law $P$, $L(h)$ is minimal for the Bayes classifier: $h_*(X) = \left\{ \begin{matrix} 1 \ \mbox{if} \ \mathbb{P}(Y=1 \mid X) \geq 1/2 \\ 0 \ \mbox{otherwise} \end{matrix} \right.$. In general we do not know the law $P$ but only a set of learning data $(X_i, Y_i)_{i=1, \cdots, n}$ following the law $P$. Our goal is therefore to construct from these learning data a classifier $\hat h$ which minimises $L(\hat h)-L(h_*)$.

Since the law $P$ is unknown, we instead calculate the empirical probability of misclassification: $$ L_n(h) = \frac{1}{n} \sum_{i=1}^n(\delta(Y_i \neq h(X_i)) $$

To a set $D \subset \Bbb F(H, \{0,1\})$ of classifiers we can associate the empirical risk minimisation classifier: $$ h_{\mathrm{min}}(D) \in \mathrm{argmin}\ _{h \in D} \ L_n(h) $$

Finally the quality of the approximation is: $$ L(h_{\mathrm{min}}(D)) - L(h_*) = \biggl (\min\ _{h \in D}\ L(h) - L(h_*)\biggr) + \biggl(L(h_{\mathrm{min}}(D)) - \min\ _{h \in D}\ L(h)\biggr) $$

The first term is the approximation error which can be reduced by expanding the set of classifiers studied. The second term is the estimation error, which is due to the fact that the empirical cost has been minimized and not the true probability of misclassification. This second term tends to increase with $D$.

The choice of $D$ is therefore important since there is a balance to strike between the approximation and estimation errors. This is why we have programmed different algorithms that explore different types of classifiers: the LSTM, the GAN, and the quantile regression classifier. This allows us, on the one hand to see if one type of classifier approximates better than another, and on the other hand to obtain a better result relative to specific metrics of interest (recall and precision, as defined later).

Our solution to the problem¶

The following sections present briefly the preprocessing step, as well as the five algorithms that we implemented. For more detail, please see the specific notebooks which gives more explanations, some graphical representations and the code for the following methods :

Preprocessing¶

More details available in the associated Notebook.

First we needed to preprocess our data, since raw data from Air Liquide are not usable as such by machine learning algorithms: they contain noise, outliers due to sensor faults, and long-term variations (for example the temperature follows the natural trend of the outside temperature induced by the seasons). We need to get rid of these issues in order to implement and use most of our algorithms.

The first step was thresholding. It consists in applying thresholds on the different features we selected (based on their relative correlation) that just replace every absurd value (error reports from the sensors which are constant known integers) with NaN values, NaN standing for “Not a Number”.

Then, comes the gap filling step. We filled all the gaps created either by the thresholding that induces NaNs but also by the missing measurements that occurred during the monitoring (we aim to obtain values at regular time intervals). Those gaps are replaced by points calculated using a linear regression for several reasons: their impact on the global appearance of the curve will be reduced by the other steps of the preprocessing, and this regression method is simple and is the least computation-intensive. Finally our customer did not want us to create values out of the blue and a linear approximation was close enough to reality to be acceptable.

The standardization step consists in highlighting the anomalies. To do that, we calculate the mean and the standard deviation of the signal on most of the normal phases of our training dataset and use these values to standardize the signal for every selected feature.

The signal remains full of noise and the gap filling step induced some abnormal intervals. Therefore, we need to find a function meeting a few requirements:

Delete most of the noise
Smooth the steep variations created by the gap filling
Keep the global tendency of the signal

A compromise was found between the methods used and the bandwidth we wanted to work on in order to conciliate those requirements.

Now that high frequencies were erased from the spectrum, very low frequencies had to be cleaned out by the detrending step. This allowed us to keep only the frequencies we had to work on.

The signal — now preprocessed and cleared — could finally be fed into all of our algorithms, either for training them or for monitoring the machine. We strictly separated the test set from the train set, and kept a bit of data on the side for testing hyperparameter tuning. This is crucial for the machine learning algorithms that must not be tested on the data on which they were trained.

We decided to implement five different algorithms and then combined the results obtained using a voting system. The first two algorithms detect the oscillations (by analysing the signal’s structure): they are the Ruptures method and the EMD method.

Ruptures¶

More details available in the associated Notebook.

Ruptures is a Python library used to detect regime changes in a signal by returning a list of breakpoints. Only this library is not suitable for online operation. We decided to use the Pelt algorithm: its main attribute is that it doesn’t need to know in advance the number of breakpoints. We opted for optimizing the core of the Pelt algorithm so that it could adapt to online operation.

Rupture explanation — Figure 1: Ruptures-based algorithm detecting breakpoints in a test signal.

EMD¶

More details available in the associated Notebook.

EMD (Empirical Mode Decomposition) is an empirical method of signal processing which decomposes a signal into IMFs (Intrinsic Mode Functions). A Hilbert transform of these IMFs allowed us to obtain their amplitude, and their instantaneous frequencies. We then noticed that the weak oscillations which caused the problem were quite obvious on these IMFs and that we could detect them using this method.

EMD explanation 1 — Figure 2: IMFs of a test signal.

LSTM¶

More details available in the associated Notebook.

The Long and Short-Term Memory (LSTM) algorithm is a recurrent neural network that is able to predict the rest of the “Température palier étage 1” signal from the available signal history (past values) and that of other uncorrelated features (to maximize reliability) . When a signal is used as an input of the program, the output signal predicted for each point by the LSTM will be compared with the real one. If the two signals are different enough the input signal will be considered as abnormal.

LSTM explanation — Figure 4: Working principle of the LSTM algorithm.

Quantile regression¶

More details available in the associated Notebook.

Quantile regression is a method of regression that works on the quantiles instead of the actual values. It is implemented through set learning using weak classifiers — trees — which, once assembled using gradient boosting, produce a single strong classifier. The advantage is that the method is way more robust than the standard least square regression against outliers. Indeed, on a large set of data, quantiles better depict the curve itself. Hence, with two quantiles we can predict an interval inside of which the values are expected to be found. And, like for the LSTM, if a bunch of values are outside of the predicted nominal range, an alarm will be raised.

Quantile regression explanation — Figure 5: Upper and lower predicted quantiles compared with the actual signal on a normal set.

GAN¶

More details available in the associated Notebook.

A GAN (Generative Adversarial Network) is an unsupervised neural network which is comprised of two neural networks competing to achieve the best results: the generator and the discriminator. The generator’s job is to generate believable fake data from random noise in order to fool the discriminator. The discriminator’s job is to discriminate between fake and real data being sent to it. This implies that, once shown erroneous data or some with visible anomalies, the discriminator will be able to detect it and raise an alarm.

GAN explanation — Figure 6: Working principle of the GAN algorithm (*Source: Air Liquide*).

Integration to the interface¶

The interface is composed of several pages The main one displays the global alarm level depending on the system’s state obtained by the voting system (and shows its state at a given time with a month's history). Other pages are offered in order to see the individual results of the different algorithms.

Presentation and analysis of the results¶

To compare our algorithms and to have an idea of their performance, we used the precision and recall metrics. We focused on three variables:

$T_p$ = the number of true positive anomalies (that are detected by the algorithm as anomalies and that are real problems)
$F_p$ = the number of false positive anomalies (that are detected by the algorithm as anomalies but are not real problems)
$F_n$ = the number of false negative anomalies (that are not detected by the algorithm as anomalies but are real problems)

The definitions of the recall and precision are:

$\mbox{Recall} = \frac{T_p}{T_p + F_n}$
$\mbox{Precision} = \frac{T_p}{T_p + F_p}$
$\mathrm{F}_1\mbox{-Score} = \frac{2}{{1 \over \mathrm{Recall}} + {1 \over \mathrm{Precision}}}$

In other words, the recall quantifies the mistakes of an algorithm when not detecting an anomaly while the precision gives the percentage of true anomalies among all the anomalies detected.

We decided to privilege recall over precision because Air Liquide explicitly asked us to make sure all problems were detected. Since PRiSM already raises many false alarms, precision is less important but we still try to maximize it, if it does not impact the recall. Resultantly, recall, precision and F1_score (that combines the two) are key parameters to evaluate the performance of our algorithms.