Quantile Regression¶
from IPython.display import Image
I. Theory¶
How does quantile regression basically work?¶
A quantile regression is simply a method of regression that works on the quantiles instead of the actual values.
What are its avantages?¶
The advantage is that the method is way more robust than the standard least square regression against outliers. Indeed, on a large set of data, quantile represent more a trend of the curve. Hence, with two quantiles we can predict an interval where the values are supposed to be.
Deepest understanding of the method:¶
To understand quantile regression we first need to understand the machine learning algorithm behind it: gradient boosting regression which uses decision trees.
Introduction to our dataset :¶
To train and test our algorithm, we have a dataset of points regularly spaced. Those points are in $n$ (in our case $n=5$) dimensions and they all represent a feature selected during the pre-processing. We can represent each point at the instant t by :
$$ x(t) = \begin{pmatrix} \mbox{feature}_1(t)\\\mbox{feature}_2(t)\\...\\ \mbox{feature}_n(t)\end{pmatrix}\space \ \mbox{where} \space \mbox{feature}_i(t) \in [a_i,b_i] \subset \mathbb{R} $$So each point is contained in $X = \displaystyle\prod_{i=1}^{n}[a_i,b_i]$.
Decision trees :¶
Most of us are used to use decision trees mainly in conditional probabilities. To build a decision tree we go from a root situation and we split this situation in a complete partition with every possible outcome called branches. Then depending on the first situation outcome we arrive to a second situation called a node that can still be splitted into partitions and so on until we arrive to the final situation called a leaf.
In our case, the root is $X$ and each branch is a partition of the previous set. For example, if a node is a set $A$, each branch goes to a node $B_i$ where $(B_i)_{i \in [0,n]}$ is a partition of $A$. We repeat this process until we only have min_sample_split
point in the set or when we reach max_depth
splitting except if there are more than max_sample_split
values (in this case we repeat the split). This "dead-end" point is called a leaf and represents a possible outcome of our tree. Each leaf is linked to a value we want to predict, deduced from the $k$ samples ($k \in$ [min_sample_split
, max_sample_split
]) in this partition.
Image(filename ='images_iquantile/decisionTree.png')
Just above we can see an example of a tree with the generic notation (on the left) and a concrete example (on the right) where the objective is to determine $y$ in function of a feature called $x$ which is a real number. The value of $y$ might not be the exact one corresponding to a given $x$ value but close enough if there are enough partitions.
From a forest of trees to a robust prediction¶
The problem when only one tree is used is that the prediction is not very robust, a tree is considered as a weak learner. To solve this problem we use boosting, and more precisely a method called gradient boosting.
Boosting¶
Generally, boosting refers to a machine learning method which consists in using many weak learners (as trees) to create a strong learner. We can easily illustrate this algorithm with the diagram below: each tree predicts a value, they can then be weighted depending on how they predicted the training data and finally we take the weighted average of those predictions as a result.
Image("images_iquantile/Gradient-boosted-decision-tree.png")
Gradient boosting¶
Theory¶
We have a dataset $D_n=\{(x_1,y_1), ... , (x_n,y_n)\}$ with $x_i \in X\subset \mathbb{R}^d $ and $y_i \in Y\subset \mathbb{R}$.\ The output of this algorithm is the boosting predictor ($\mathrm{reg}$). $$ \mathrm{reg} : \left\{ \begin{array}{ll} X \rightarrow \mathbb{R} \\ x \rightarrow \hat{y} \end{array} \right. \\ $$
This function will calculate for each new value of all the features X_test
the estimation y_test
. It is a linear combination of simple predictors h, often decision trees
$$h
_k : X \rightarrow \mathbb{R} \\ \mbox{reg} = \sum_{k=1}^{m} \beta _k *h_k \ \mbox{with} \ \beta_k \in \mathbb{R}
$$ Those weak learners are chosen in order to minimize a certain cost function $C_n(\mathrm{reg})=\frac{1}{n} \sum_{i=1
}^{n} \phi(\mathrm{reg}(x_i),y_i)$
\
with $\phi : \mathbb{R} * Y \rightarrow \mathbb{R}_+$ the loss function that needs to be convex.
Noting $(X_0, Y_0)$ a generic pair of random variables with distribution $\mu _ {X,Y}$, we have to minimize the quantity : $$ C(\mathrm{reg})= \mathbb{E} \phi(F(X_0),Y_0) $$ over the linear combinations of functions in a given subset of $L²(\mu_X)$.
We also introduce the subgradient $\xi(x,y)$ of $\phi(x,y)$ : $\xi(x,y) \in [ \partial _x ^- \phi(x,y) ; \partial _x ^+ \phi(x,y)]$
In order to prove the convergence of the algorithm used, we introduce some assumptions :
- A1 : C is locally bounded, $C(\mathrm{reg}) < \infty$
- A2 : $\exists \alpha>0, \ \forall y \in Y, \phi(.,y)$ is $\alpha$-strongly convex: $\forall (x_1,x_2)\in\mathbb{R}², \forall t \in [0;1], \ \phi(tx_1+(1-t)x_2,y)\le t\phi(x_1,y)+(1-t)\phi(x_2,y) - \frac{\alpha}{2}t(1-t)(x_1-x_2)²$
- A3 : $\exists L >0, \forall(x_1,x_2) \in \mathbb{R}², \ | \mathbb{E}(\xi(x_1,Y)-\xi(x_2,Y) |X) | \le L |x_1-x_2|$
We note $\mathrm{lin}(F)$ the set of all linear combinations of h, we wish to locate the minimum of C over $\mathrm{lin}(F)$. The gradient boosting approach consists in locating the minimum by producing this linear combination of weak learners via a gradient descent-type algorithm.
The assumption A1 allows us to say that $\displaystyle \inf_{\mathrm{reg} \in \mathrm{lin}(F)} C(\mathrm{reg}) = \displaystyle \inf_{\mathrm{reg} \in \overline{\mathrm{lin}(F)}} C(\mathrm{reg})$
Furthermore the A2 proves that $ \exists ! \, \overline{\mathrm{reg}} \in \overline{\mathrm{lin}(F)}, \ C(\overline{\mathrm{reg}}) = \displaystyle \inf_{\mathrm{reg} \in \mathrm{lin}(F)} C(\mathrm{reg})$ It is the boosting predictor that is given to us by the training algorithm.
Algorithm¶
It is a recurrent algorithm based on a taylor identity $$ C(\mathrm{reg})-C(\mathrm{reg}+wh) \approx \left\{ \begin{array}{ll} -w\langle \nabla C(\mathrm{reg}),h\rangle _{\mu_X} \ \text{if} \ \phi \ \text{is continuously differentiable along its first argument} \\ -\mathbb{E}\xi(ref(X),Y)h(X) \ \ \mbox{otherwise} \end{array} \right. \\ $$
Having $\mathrm{reg}_{t-1}$ we want to find $h_t$, by the least square approximation of $-\xi(\mathrm{reg}_{t-1}(X),Y)$. We modify the collection of weak learners by considering a certain class P, a cone of $L²(\mu_X)$. We want to choose $h_t \in \mathrm{argmin}_{h\in P} \mathbb{E}(-\xi(\mathrm{reg}_{t-1}(X),Y)-h(X))² = \mathrm{argmin}_{h\in P} \left( 2\mathbb{E}\xi(\mathrm{reg}_{t-w\langle \nabla C(\mathrm{reg})1}(X),Y)+||h||²_{\mu_X} \right)$
If the function is continuously differentiable $h_t \in \mathrm{argmax}_{h\in P} \sum_{i=1}^{n}(-\nabla C(\mathrm{reg}_{t-1}(X_i))-h(X_i))²$ with $-\nabla C(\mathrm{reg}_{t-1}(X_i)) = Y_i-\mathrm{reg}_{t_1}(X_i)$
The general algorithm is :
- $\nu$>0 the learning rate, it scales the step length the gradient descent procedure
- t = 0 we initialize $\mathrm{reg}_0 \in \mathrm{lin}(F)$
- $h_{t+1} \in \mathrm{argmin}_{h \in P}$ $2\mathbb{E}\xi(\mathrm{reg}_t(X),Y)h(X)+||h||²_{\mu_X}$
- $\mathrm{reg}_{t+1} = \mathrm{reg}_t + \nu*h_{t+1}$
- t $\leftarrow$ t+1
The convergence is assured if the assumptions A1, A2 and A3 are valid, and $(\mathrm{reg}_t)_t$ defined by this algorithm, if $ 0<\nu<\frac{1}{2L}$ then $\displaystyle{\lim_{t \to \infty}}C(\mathrm{reg}_t) = \displaystyle{\inf_{reg \in lin(P)}} C(\mathrm{reg})$
Finally, under these assumptions (A1, A2, A3 and $(\mathrm{reg}_t)_t$ defined by this algorithm) the algorithm converges towards the infimum of the risk function.
Quantile prediction¶
To predict quantiles, we use a sequence $ g = \{g_n\}_n $ of quantile prediction functions $g_n : \mathbb{R}^{n-1}\to \mathbb{R}$
After $n$ time instants, we define the cumulative quantile loss on the string $(Y_1, ... , Y_n)$ as $C_n(g) = \frac{1}{n} \sum_{t=1}^{n} \rho_\alpha(Y_t-g_t(Y_1, ... , Y_{t-1}))$ where $\phi = \rho_\alpha(y) = y*(\alpha-\mathbb{1}_{y<=0}) $ is the pinball loss function, with $\alpha$ the quantile
So, we aim to reduce $C_n(g) $ as much as possible, $ C_n(g)$ representing the error of the prediction regarding the observed value. The fact is that there is a fundamental limit for the quantile predictability, or mathematically : lim inf $C_n(g) \ge$ C* .
In practice, this value C* is not achievable, so we try to make $C_n(g) $ tend to this value.
II. Imports¶
dico = {"Description" : "Point Name",
"MOTOR CURRENT" : "PARD@3C52@3C52-M@JT7099.CAL",
"Température palier étage 1" : "PARD@3C52@3C52-M@TE7011A.PNT",
"Température palier étage2" : "PARD@3C52@3C52-M@TE7021A.PNT",
"Température palier étage 3" : "PARD@3C52@3C52-M@TE7031A.PNT",
"Température palier étage 4" : "PARD@3C52@3C52-M@TE7041A.PNT",
"Déplacement axiale 1/2" : "PARD@3C52@3C52-M@VT7001.PNT:RAW",
"Déplacement axiale 3/4" : "PARD@3C52@3C52-M@VT7002.PNT:RAW",
"1e stade vibration X" : "PARD@3C52@3C52-M@VT7011A.PNT:RAW",
"1er stade vibration Y" : "PARD@3C52@3C52-M@VT7011B.PNT:RAW",
"2e stade vibration X" : "PARD@3C52@3C52-M@VT7021A.PNT:RAW",
"2e stade vibration Y" : "PARD@3C52@3C52-M@VT7021B.PNT:RAW",
"3e stade vibration X" : "PARD@3C52@3C52-M@VT7031A.PNT:RAW",
"3e stade vibration Y" : "PARD@3C52@3C52-M@VT7031B.PNT:RAW",
"4e stade vibration X" : "PARD@3C52@3C52-M@VT7041A.PNT:RAW",
"4e stade vibration Y" : "PARD@3C52@3C52-M@VT7041B.PNT:RAW",
"Température huile sortie réfrigerant" : "PARD@3C52@3C52@TE7086.PNT",
"labels" : "labels"}
inv_dico = {v: k for k, v in dico.items()}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, ensemble
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
import pickle
We divide our data between training data that the algorithm will use to learn and the testing data, containing the known fluctuations, to verify the efficiency of the parameters found.
- normal data
df_gen = pd.read_csv('data_gan/normal_0_14000_detrended.csv')
motor_n1 = np.array(df_gen[:10000]['PARD@3C52@3C52-M@JT7099.CAL'])
temp_n1 = np.array(df_gen[:10000][dico['Température palier étage 1']])
depl1_n1 = np.array(df_gen[:10000][dico['Déplacement axiale 1/2']])
depl2_n1 = np.array(df_gen[:10000][dico['Déplacement axiale 3/4']])
vib_n1 = np.array(df_gen[:10000][dico['1e stade vibration X']])
label_n1 = np.array(df_gen[:10000][dico['labels']])
df_gen1 = pd.read_csv('data_gan/normal_66900_100032_detrended.csv')
motor_n2 = np.array(df_gen1[:]['PARD@3C52@3C52-M@JT7099.CAL'])
temp_n2 = np.array(df_gen1[:][dico['Température palier étage 1']])
depl1_n2 = np.array(df_gen1[:][dico['Déplacement axiale 1/2']])
depl2_n2 = np.array(df_gen1[:][dico['Déplacement axiale 3/4']])
vib_n2 = np.array(df_gen1[:][dico['1e stade vibration X']])
label_n2 = np.array(df_gen1[:][dico['labels']])
- anormal data (with a problem),
df_gena3
will be used to do the final tests on the predictors
df_gena1 = pd.read_csv('data_gan/n_anormal_43089_66900_detrended.csv')
motor_an1 = np.array(df_gena1[:]['PARD@3C52@3C52-M@JT7099.CAL'])
temp_an1 = np.array(df_gena1[:][dico['Température palier étage 1']])
depl1_an1 = np.array(df_gena1[:][dico['Déplacement axiale 1/2']])
depl2_an1 = np.array(df_gena1[:][dico['Déplacement axiale 3/4']])
vib_an1 = np.array(df_gena1[:][dico['1e stade vibration X']])
label_an1 = np.array(df_gena1[:][dico['labels']])
df_gena2 = pd.read_csv('data_gan/n_anormal_14000_43089_detrended.csv')
motor_an2 = np.array(df_gena2[:]['PARD@3C52@3C52-M@JT7099.CAL'])
temp_an2 = np.array(df_gena2[:][dico['Température palier étage 1']])
depl1_an2 = np.array(df_gena2[:][dico['Déplacement axiale 1/2']])
depl2_an2 = np.array(df_gena2[:][dico['Déplacement axiale 3/4']])
vib_an2 = np.array(df_gena2[:][dico['1e stade vibration X']])
label_an2 = np.array(df_gena2[:][dico['labels']])
df_gena3 = pd.read_csv('data_gan/n_anormal_100032_end_detrended.csv')
motor_an3 = np.array(df_gena3[:]['PARD@3C52@3C52-M@JT7099.CAL'])
temp_an3 = np.array(df_gena3[:][dico['Température palier étage 1']])
depl1_an3 = np.array(df_gena3[:][dico['Déplacement axiale 1/2']])
depl2_an3 = np.array(df_gena3[:][dico['Déplacement axiale 3/4']])
vib_an3 = np.array(df_gena3[:][dico['1e stade vibration X']])
label_an3 = np.array(df_gena3[:][dico['labels']])
We concatenate all the features into a test array and a train array, those will then be given to the algorithm.
X_train = np.array([motor_n1, temp_n1, depl1_n1, depl2_n1, vib_n1]).T
y_train = X_train[1:,1]
X_train = X_train[:-1]
label_train = label_n1
X_test = np.array([motor_an1, temp_an1, depl1_an1, depl2_an1, vib_an1]).T
y_test = X_test[1:,1]
X_test = X_test[:-1]
label_test = label_an1
We choose some hyper parameters :
nb_estimators : number of weak learners used to create the gradient boosting final predictor, the number of boosting stages to perform
max_depth : controls the size of the estimators used -> impacts the time needed to train the predictors
learning rate : hyper-parameter in the range [0.0, 1.0] that controls overfitting. Step in the gradient descent algorithm
min_sample_split : minimum number of sample points requiered to split the node
loss : function of cost of our algorithm, for the quantile regression the quantile loss function
alpha : percent of the quantile wanted
We are going to choose them later to have the best prediction possible.
alpha = 0.9
params = {'n_estimators': 350,
'max_depth': 6,
'alpha' : alpha,
'min_samples_split': 5,
'learning_rate': 0.1,
'loss': 'quantile'}
buffer_time = 504
maj = 700 # justified later
III. Function¶
We train our function with the learning data. First we predict the upper bound with the function predict applied to Xtest. Reg is the boosting predictor that has been trained on the data given, here *X train and y_train. We create two predictors reg_low and reg_up* that will be trained to predict the upper and lower quartiles.
reg_up = ensemble.GradientBoostingRegressor(**params)
reg_up = reg_up.fit(X_train, y_train)
y_upper_test = reg_up.predict(X_test)
y_upper_train = reg_up.predict(X_train)
Then the lower bound.
reg_low = ensemble.GradientBoostingRegressor(**params)
reg_low.set_params(alpha=1.0 - alpha)
reg_low = reg_low.fit(X_train, y_train)
y_lower_test = reg_low.predict(X_test)
y_lower_train = reg_low.predict(X_train)
IV. Plotting¶
On the training data :
fig = plt.figure(figsize = (20,10))
line3 = plt.plot(y_train,'b',label=u'Observations')
line1 = plt.plot(y_upper_train, 'r--', label='Upper quartile')
line2 = plt.plot(y_lower_train, 'g--',label='Lower quartile')
"""plt.fill(np.concatenate([xx, xx[::-1]]),
np.concatenate([y_upper, y_lower[::-1]]),
alpha=.5, fc='b', ec='None', label='95% prediction interval')"""
plt.xlabel('$t$')
plt.ylabel('Température')
plt.fill_between(range(0,len(y_upper_train)),y_upper_train, y_lower_train, alpha =0.1)
plt.legend()
plt.xlim(1000,5000)
plt.show()
If we zoom on a small interval we see that the quantiles are very close to the real data :
fig = plt.figure(figsize = (20,10))
line3 = plt.plot(y_train,'b',label=u'Observations')
line1 = plt.plot(y_upper_train, 'r--', label='Upper quartile')
line2 = plt.plot(y_lower_train, 'g--',label='Lower quartile')
plt.xlabel('$t$')
plt.ylabel('Température')
plt.fill_between(range(0,len(y_upper_train)),y_upper_train, y_lower_train, alpha =0.1)
plt.legend()
plt.xlim(300,700)
plt.ylim(-0.05,0.01)
(-0.05, 0.01)
We verify that aproximately 1-2*$\alpha$ of the values are between the two quantiles.
nb_up, nb_low = 0, 0
for k,ele in enumerate(y_train):
if ele > y_upper_train[k] :
nb_up += 1
elif ele < y_lower_train[k]:
nb_low -= 1
nb_tot = len(y_train)
above = nb_up /nb_tot
print(f'percent above :{nb_up/nb_tot}')
print(f'percent below : {-nb_low/nb_tot}')
percent above :0.0997099709970997 percent below : 0.09760976097609761
This is very close to what was expected, with alpha = 0.9 and alpha = 0.1 we find around 80% of all the values in the calculated inteval. The predictor has been correctly trained, we can now apply it to the test data to verify the choice of hyper-parameters.
On the test data which is supposed to contain problems
fig = plt.figure(figsize = (20,10))
line1 = plt.plot(y_test,'b',label=u'Observations')
line2 = plt.plot(y_upper_test, 'r--', label = 'Upper quartile')
line3 = plt.plot(y_lower_test, 'g--', label='Lower quartile')
"""plt.fill(np.concatenate([xx, xx[::-1]]),
np.concatenate([y_upper, y_lower[::-1]]),
alpha=.5, fc='b', ec='None', label='95% prediction interval')"""
plt.xlabel('$t$')
plt.ylabel('Température')
plt.fill_between(range(0,len(y_upper_test)),y_upper_test, y_lower_test, alpha =0.1)
plt.plot(label_test/3, label = 'labels ')
plt.xlim(10000,25000)
plt.legend()
plt.show()
On the localized problem
fig = plt.figure(figsize = (20,10))
line1 = plt.plot(y_test,'b',label=u'Observations')
line2 = plt.plot(y_upper_test, 'r--', label = 'Upper quartile')
line3 = plt.plot(y_lower_test, 'g--', label='Lower quartile')
plt.xlabel('$t$')
plt.ylabel('Température')
plt.fill_between(range(0,len(y_upper_test)),y_upper_test, y_lower_test, alpha =0.1)
plt.xlim(7000,10000)
#plt.ylim(-0.2,0.2)
plt.legend()
plt.show()
To verify that this portion of values contains an error, we calculate again the percentage of values over and below the two quartiles.
nb_up, nb_low = 0, 0
for k,ele in enumerate(y_train):
if ele > y_upper_test[k] :
nb_up += 1
elif ele < y_lower_test[k]:
nb_low -= 1
nb_tot = len(y_test)
above = nb_up /nb_tot
print(f'percent above :{nb_up/nb_tot}')
print(f'percent below : {-nb_low/nb_tot}')
percent above :0.2127677446451071 percent below : 0.200839983200336
Thus, when the data is supposed to show a problem, around 40% of the values are out of the predicted inteval. Our predictor doesn't predict these variations, they can therefore be considered as problematic.
V. Interpretation of feature importance¶
feature_importance = reg_low.feature_importances_
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
fig = plt.figure(figsize=(12, 6))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(['motor', 'temp', 'depl1', 'depl2', 'vib'])[sorted_idx])
plt.title('Feature Importance (MDI)')
Text(0.5, 1.0, 'Feature Importance (MDI)')
We see here that the most important feature is the temperature, this is consistent with the raw data we saw at the beginning, the error was located on the temperature feature.
VI. Creating an alarm¶
We are focusing on calculating the area between the upper and lower quartile predicted values and the actual value, we henceforth have two variables up and low, they can be used as references : when becoming negative they point out a potential problem. In order to prevent having too many false alarms, we will calculate this area on a two day basis (288 points). Finaly we put a threshold to this area relative to the standard area (the one we calculate when there is no problem).
from scipy.integrate import trapz
- up > 0 : error
- low > 0 : error
To integrate we use the function trapz, it enables us to calculate it from a np.array.
To decide our threshold, we first calculate the standard area between the "out of quantiles" data and the quantiles. The threshold will be a multiple of this area with a parameter we have to find.
We choose maj
= 700 to optimize the recall and F1.
def inte(y, y_lower, y_upper):
inte = 0
for k in range(len(y)):
if k >= buffer_time :
ecart_up = y[k : k+288] - y_upper[k: k + 288]
ecart_up[ecart_up < 0] = 0
up = trapz(ecart_up)
ecart_low = y_lower[k: k+288] - y[k : k+288]
ecart_low[ecart_low < 0] = 0
low = trapz(ecart_low)
inte += up + low
inte_mean = inte/(len(y)-buffer_time)
return inte_mean
standard_area = inte(y_train,y_lower_train,y_upper_train)
print(standard_area)
0.010561052426917341
def error(y_test, y_lower_test, y_upper_test, begin,maj):
ecart_up = y_test[begin : begin+288] - y_upper_test[begin: begin + 288]
ecart_up[ecart_up <0] =0
up = trapz(ecart_up)
ecart_low = y_lower_test[begin: begin+288] - y_test[begin : begin+288]
ecart_low[ecart_low < 0] = 0
low = trapz(ecart_low)
inte = up + low
if inte > maj*standard_area :
return 1 #error detected
else :
return 0 # normal situation
er = np.zeros(len(y_test))
for k in range(len(y_test)):
if k >= buffer_time :
er_loc = error(y_test, y_lower_test, y_upper_test, k, maj)
if er_loc == 1 :
er[k] = 1
plt.figure(figsize=(20,10))
plt.plot(er, label = '1: found anomalies')
plt.plot(label_test/6, label = 'labels')
plt.plot(y_test, label = 'signal')
plt.legend();
Even if we can see some false positives between 18000 and 25000, we can also question the labelling because we can see some important variations even if the label says normal.
VII. Quantifying the error and choosing the hyper-parameters¶
In order to quantify the error and test the accuracy we calculate for different values of the parameters we use 3 values :
Precision is the fraction of relevant instances among the retrieved instances
Recall is the fraction of the total amount of relevant instances that were actually retrieved
F1 score is the harmonic mean of the precision and recall.
Fo the issue we are trying to tackle we wish for recall
to be as high as possible (=1), even if the precision
is not ideal. Finally we can adjust the last parameters while trying to maximize the F1 score
.
$precison = \frac{true \ positives}{true \ positives\ +\ false\ positives}$ and $recall = \frac{true \ positives}{true \ positives\ +\ false\ negatives}$ finally $F1\ score = 2*\frac{recall*precison}{recall\ +\ precision}$
positive_label = np.count_nonzero(label_test)
positive_predict = np.count_nonzero(er)
label_mask = label_test != 0
er_mask = er > 0
true_positive = label_mask[:-1]*er_mask
tp = np.count_nonzero(true_positive)
label_mask1 = label_test != 0
er_mask1 = er == 0
false_neg = label_mask1[:-1]*er_mask1
fn = np.count_nonzero(false_neg)
label_mask2 = label_test == 0
false_pos = label_mask2[:-1]*er_mask
fp = np.count_nonzero(false_pos)
true_neg = len(er) - np.count_nonzero(true_positive)-np.count_nonzero(false_neg)-np.count_nonzero(false_pos)
precision_before = tp /(tp+fp)
recall_before = tp/(tp+fn)
f_1_score_before = 2*(precision_before*recall_before)/(precision_before + recall_before)
print(f"precision= {precision_before}")
print(f"recall= {recall_before}")
print(f"f1 score= {f_1_score_before}")
precision= 0.6724587963308226 recall= 0.8497785317123739 f1 score= 0.7507910074937552
To visualize the error :¶
In order to avoid any false negative detections, we add an other function error ER
, which triggers an error if an anomaly was detected less than two days ago (288
points).
The goal is to avoid describing the system as normal between two big oscillations, that are systematically signs of anomalies. Indeed, our system detects easily when the variations are important however, between two anomalies the signal can 'come back to normal' stopping the alarm that is sent by our program, however the dangerous phenomenon is still present. This is why we create a memory of the alarm triggered, it stays in an alert state if an alarm has been raised in the last 2 days. To differenciate the memory alarm and the real alarm detection we introduce a color pattern :
green if the solution does not detect any anormal oscillation since one week $\rightarrow$ 0
orange if is not currently detecting an anormal oscillation but one has been detected less than a week ago $\rightarrow$ 1
red if is currently detecting a trouble $\rightarrow$ 2
ER = np.zeros(len(y_test))
memory = False
for k in range(len(y_test)):
if k >= buffer_time :
if er[k] == 1 :
memory = True
ER[k] = 1
elif memory and er[k-buffer_time:k].any() :
ER[k] = 1
elif not(er[k-buffer_time:k].any()) :
memory = False
Colour = len(y_test) * [0]
for k, ele in enumerate(y_test):
if er[k] == 1 and ER[k] == 1:
Colour[k] = 2
elif er[k] == 0 and ER[k] == 1:
Colour[k] = 1
plt.figure(figsize=(20,10))
plt.plot(Colour, label = '1: recent anomalies 2 : found anomalies')
plt.plot(label_test, label = 'labels')
plt.plot(y_test)
plt.legend();
positive_label = np.count_nonzero(label_test)
positive_predict = np.count_nonzero(ER)
label_mask = label_test != 0
ER_mask = ER > 0
true_positive = label_mask[:-1]*ER_mask
tp = np.count_nonzero(true_positive)
label_mask1 = label_test != 0
ER_mask1 = ER == 0
false_neg = label_mask1[:-1]*ER_mask1
fn = np.count_nonzero(false_neg)
label_mask2 = label_test == 0
false_pos = label_mask2[:-1]*ER_mask
fp = np.count_nonzero(false_pos)
true_neg = len(ER) - np.count_nonzero(true_positive)-np.count_nonzero(false_neg)-np.count_nonzero(false_pos)
precision_after = tp /(tp+fp)
recall_after = tp/(tp+fn)
f_1_score_after = 2*(precision_after*recall_after)/(precision_after + recall_after)
print(f"precision= {precision_after}")
print(f"recall= {recall_after}")
print(f"f1 score= {f_1_score_after}")
precision= 0.6444906444906445 recall= 0.9640938648572236 f1 score= 0.7725419120978704
With this memory system the precison decreases, yet for our analysis, it is the recall value that should be close to one, to avoid any false negatives.
By testing our algorithm with different values for learning rate
, alpha
, nb_estimators
, max_depth
, buffer
and min_samples_split
we looked for the best set of values to find recall
near 1 and to maximize F1-score
.
First we fix the learning rate
with a given set of parameters:
Image(filename ='images_iquantile/learningrate.png')
As we can see a learning rate
= 0.1 is a very good choice, even if the recall is low, the algorithm actually spot every irregularities. Then we test many sets of parameters and plot the recall as a function of the precision (figure below).
Image(filename = 'images_iquantile/precision-recall.png')
Finally we keep as hyper-parameters :
alpha
= 0.85nb_estimators
= 300max_depth
= 4buffer
= 504 (3.5 days)min_sample_split
= 3
With the same idea we tested different values of learning rate
and compared the values of precision, recall and F1_score.
learning rate
= 0.1
VIII. Final training of the predictor¶
X_train_final = np.array([motor_n2, temp_n2, depl1_n2, depl2_n2, vib_n2]).T
y_train_final = X_train[1:,1]
X_train_final = X_train[:-1]
label_train_final = label_n2
X_test_final = np.array([motor_an3, temp_an3, depl1_an3, depl2_an3, vib_an3]).T
y_test_final = X_test_final[1:,1]
X_test_final = X_test_final[:-1]
label_test_final = label_an3
First we create two new predictors with the right parameters, then we train them on all the 'normal' data we have.
params = {'n_estimators': 300,
'max_depth': 4,
'alpha' : 0.85,
'min_samples_split': 3,
'learning_rate': 0.1,
'loss': 'quantile'}
reg_up.set_params(**params)
buffer_time = 504
params = {'n_estimators': 300,
'max_depth': 4,
'alpha' : 0.15,
'min_samples_split': 3,
'learning_rate': 0.1,
'loss': 'quantile'}
reg_low.set_params(**params)
GradientBoostingRegressor(alpha=0.15, loss='quantile', max_depth=4, min_samples_split=3, n_estimators=300)
reg_up.fit(X_train_final, y_train_final)
reg_up.fit(X_train, y_train)
reg_low.fit(X_train, y_train)
reg_low.fit(X_train_final, y_train_final)
GradientBoostingRegressor(alpha=0.15, loss='quantile', max_depth=4, min_samples_split=3, n_estimators=300)
We save those models to be able to use them without retraining.
model1 = 'finalized_predictor_up.sav'
pickle.dump(reg_up, open(model1, 'wb'))
model2 = 'finalized_predictor_low.sav'
pickle.dump(reg_low, open(model2, 'wb'))
We then predict our quantiles on the test data:
y_upper_test_f = reg_up.predict(X_test_final)
y_lower_test_f = reg_low.predict(X_test_final)
To verify the efficiency of the training we calculate the error made on the different datasets that have not yet been used.
er_f = np.zeros(len(y_test_final))
for k in range(len(y_test_final)):
if k >= buffer_time :
er_loc = error(y_test_final, y_lower_test_f, y_upper_test_f, k, maj)
if er_loc == 1 :
er_f[k] = 1
ER_f = np.zeros(len(y_test_final))
memory_f = False
for k in range(len(y_test_final)):
if k >= buffer_time :
if er_f[k] == 1 :
memory_f = True
ER_f[k] = 1
elif memory_f and er_f[k-buffer_time:k].any() :
ER_f[k] = 1
elif not(er_f[k-buffer_time:k].any()) :
memory_f = False
positive_label_f = np.count_nonzero(label_test_final)
positive_predict_f = np.count_nonzero(er_f)
label_mask_f = label_test_final != 0
ER_mask_f = ER_f > 0
true_positive_f = label_mask_f[1:]*ER_mask_f
tp_f = np.count_nonzero(true_positive_f)
label_mask1_f = label_test_final != 0
ER_mask1_f = ER_f == 0
false_neg_f = label_mask1_f[1:]*ER_mask1_f
fn_f = np.count_nonzero(false_neg_f)
label_mask2_f = label_test_final == 0
ER_mask2_f = ER_f > 0
false_pos_f = label_mask2_f[1:]*ER_mask2_f
fp_f = np.count_nonzero(false_pos_f)
true_neg_f = len(ER_f) - tp_f-fn_f-fp_f
precision_f = tp_f /(tp_f+fp_f)
recall_f = tp_f/(tp_f+fn_f)
f_1_score_f = 2*(precision_f*recall_f)/(precision_f + recall_f)
print(f"precision= {precision_f}")
print(f"recall= {recall_f}")
print(f"f1 score= {f_1_score_f}")
precision= 0.575109282591926 recall= 0.9951058907278876 f1 score= 0.7289378483199167
Colour_f = np.zeros(len(y_test_final))
for k, ele in enumerate(y_test_final):
if er_f[k] == 1 and ER_f[k] == 1:
Colour_f[k] = 2
elif er_f[k] == 0 and ER_f[k] == 1:
Colour_f[k] = 1
fig = plt.figure(figsize = (20,10))
line1 = plt.plot(y_test_final,'b',label='Observations')
line2 = plt.plot(y_upper_test_f, 'r--', label = 'Upper quartile')
line3 = plt.plot(y_lower_test_f, 'g--', label='Lower quartile')
line4 = plt.plot(label_test_final/3,label = 'labels')
plt.xlabel('$t$')
plt.ylabel('Température')
plt.fill_between(range(0,len(y_upper_test_f)),y_upper_test_f, y_lower_test_f, alpha =0.1)
plt.xlim(20000,25000)
plt.legend()
<matplotlib.legend.Legend at 0x7fedd5752510>
plt.figure(figsize=(20,10))
plt.plot(label_test_final, label = 'labels')
plt.plot(Colour_f, label='color label')
plt.legend();
IX. Final function¶
Taking as argument a csv file of data:
def quantile_interval(file_data, model_up, model_low):
df_data = pd.read_csv(file_data)
motor = np.array(df_data['PARD@3C52@3C52-M@JT7099.CAL'])
temp = np.array(df_data[dico['Température palier étage 1']])
depl1 = np.array(df_data[dico['Déplacement axiale 1/2']])
depl2 = np.array(df_data[dico['Déplacement axiale 3/4']])
vib = np.array(df_data[dico['1e stade vibration X']])
time = np.array(df_data['index'])
reg_up = pickle.load(open(model_up, 'rb'))
reg_low = pickle.load(open(model_low, 'rb'))
X = np.array([motor, temp, depl1, depl2, vib]).T
y = X[1:,1]
X = X[:-1]
y_upper = reg_up.predict(X)
y_lower = reg_low.predict(X)
er_f = np.zeros(len(X))
for k in range(len(X)):
if k >= buffer_time :
er_loc = error(y, y_lower, y_upper, k, maj)
if er_loc == 1 :
er_f[k] = 1
ER_f = np.zeros(len(X))
memory_f = False
for k in range(len(X)):
if k >= buffer_time :
if er_f[k] == 1 :
memory_f = True
ER_f[k] = 1
elif memory_f and er_f[k-buffer_time:k].any() :
ER_f[k] = 1
elif not(er_f[k-buffer_time:k].any()) :
memory_f = False
Colour_f = np.zeros(len(X))
for k, ele in enumerate(X):
if er_f[k] == 1 and ER_f[k] == 1:
Colour_f[k] = 2
elif er_f[k] == 0 and ER_f[k] == 1:
Colour_f[k] = 1
names = np.array(['timestamp', 'status', 'lower quantile', 'upper quantile'])
features = np.array([time[-5000:], Colour_f[-5000:].astype(int), y_lower[-5000:], y_upper[-5000:]],dtype = object).T
final = np.vstack((names,features))
return final,temp
plt.figure(figsize=(20,10))
ret, y = quantile_interval('data_gan/n_anormal_14000_43089_detrended.csv', 'finalized_predictor_up.sav', 'finalized_predictor_low.sav')
plt.plot(ret[1:,1], label = 'color')
plt.plot(ret[1:,3], label = 'y_upper')
plt.plot(ret[1:,2], label = 'y_lower')
plt.plot(y[-5000:], label = 'values')
plt.legend()
<matplotlib.legend.Legend at 0x7fedd5216690>
plt.figure(figsize=(20,10))
ret, y = quantile_interval('data_gan/n_anormal_43089_66900_detrended.csv', 'finalized_predictor_up.sav', 'finalized_predictor_low.sav')
plt.plot(ret[1:,1], label = 'color')
plt.plot(ret[1:,3], label = 'y_upper')
plt.plot(ret[1:,2], label = 'y_lower')
plt.plot(y[-5000:], label = 'values')
plt.legend()
<matplotlib.legend.Legend at 0x7fedd5278a50>
plt.figure(figsize=(20,10))
ret, y = quantile_interval('data_gan/n_anormal_100032_end_detrended.csv', 'finalized_predictor_up.sav', 'finalized_predictor_low.sav')
plt.plot(ret[1:,1], label = 'color')
plt.plot(ret[1:,3], label = 'y_upper')
plt.plot(ret[1:,2], label = 'y_lower')
plt.plot(y[-5000:], label = 'values')
plt.legend()
<matplotlib.legend.Legend at 0x7fedd5023710>
X. References¶
- G.Biau and B.Cadre. Optimization by gradient boosting. 17th July 2017. Hal-01562618
- G.Biau and B.Patra. Sequential Quantile Prediction of Time Series. DOI : 10.1109/TIT.2011.2104610