Long short-term memory (LSTM)¶
from IPython.display import Image
I. Principle of the algorithm¶
This algorithm consists in a neural network, more precisely a long short-term memory (LSTM) reccurent neural network.
The idea is to code an algorithm that is able to predict the following signal of 'Température palier étage 1' when given the preceding ones in time ('Température palier étage 1' but also some other features independant enough of 'Température palier étage 1'). When a signal is used as input to the program, we will then compare the predicted signal (by the LSTM) with the actual following signal.
If the two signals are different enough (in a sense that will be defined later), the following signal will be considered as abnormal.
Below, a diagram presenting the LSTM structure, that we use to predict the $N+1^{th}$ point when given the points ranging from $1$ to $N$.
# 
Image(filename ='images_lstm/schema_lstm.jpg')
We then iterate the above structure $P$ times in order to predict the $P$ following points. We use as input the actual points of the signal. This step is illustrated below.
# 
Image(filename ='images_lstm/schema_lstm_2.jpg')
II. Preparation of the datasets used for training and testing¶
We started by separating our data in several segments that would belong to one of theses two categories :
- Normal signals
- Abnormal signals (oscillations) preceded by normal signals
This choice was made on the signal 'Température palier étage 1', since it is the signal on which we intend to detect oscillations.
In order to do so, we used the labelling that was done previously. Nonetheless, we decided to label differently some values, notably the 'normal' parts that separate two 'abnormal' signals were often considered as abnormal. This choice is justified by the fact that we want the algorithm to detect an hazardous oscillation as soon as possible. In addition, we passed some values labeled as 'ambiguous' as 'abnormal'.
The first category of segments (Normal signals) will later be used for training while the second one (Abnormal signals preceded by normal signals) will be used for testing.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch as th
import csv
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchviz import make_dot
%load_ext tensorboard
dico = {"Description" : "Point Name",
"MOTOR CURRENT" : "PARD@3C52@3C52-M@JT7099.CAL",
"Température palier étage 1" : "PARD@3C52@3C52-M@TE7011A.PNT",
"Température palier étage2" : "PARD@3C52@3C52-M@TE7021A.PNT",
"Température palier étage 3" : "PARD@3C52@3C52-M@TE7031A.PNT",
"Température palier étage 4" : "PARD@3C52@3C52-M@TE7041A.PNT",
"Déplacement axiale 1/2" : "PARD@3C52@3C52-M@VT7001.PNT:RAW",
"Déplacement axiale 3/4" : "PARD@3C52@3C52-M@VT7002.PNT:RAW",
"1e stade vibration X" : "PARD@3C52@3C52-M@VT7011A.PNT:RAW",
"1er stade vibration Y" : "PARD@3C52@3C52-M@VT7011B.PNT:RAW",
"2e stade vibration X" : "PARD@3C52@3C52-M@VT7021A.PNT:RAW",
"2e stade vibration Y" : "PARD@3C52@3C52-M@VT7021B.PNT:RAW",
"3e stade vibration X" : "PARD@3C52@3C52-M@VT7031A.PNT:RAW",
"3e stade vibration Y" : "PARD@3C52@3C52-M@VT7031B.PNT:RAW",
"4e stade vibration X" : "PARD@3C52@3C52-M@VT7041A.PNT:RAW",
"4e stade vibration Y" : "PARD@3C52@3C52-M@VT7041B.PNT:RAW",
"Température huile sortie réfrigerant" : "PARD@3C52@3C52@TE7086.PNT",
"labels" : "labels"}
device = "cuda" if th.cuda.is_available() else "cpu"
#checks if a gpu is available instead of a cpu in order to speed up the computation
print(device)
cuda
Below, we have represented the signal with the different segments that we identified as respectively normal, normal then abnormal, normal then abnormal, normal, normal then abonormal.
We have decided to precede every abnormal area with a normal area of length 7000.
df = pd.read_csv("clean_data_huile.csv")
tempetage1 = df[['PARD@3C52@3C52-M@TE7011A.PNT','labels']]
value = tempetage1['PARD@3C52@3C52-M@TE7011A.PNT']
label = tempetage1['labels']
plt.figure(figsize = (15,6))
plt.grid()
plt.plot(value, label = "Signal of 'Température palier étage 1'")
plt.plot(0.2*label, label = "Labelling")
plt.vlines(10000, -2, 2, colors = 'r', label = "Segments that divide the dataset according to its type")
plt.vlines(43089, -2, 2, colors = 'r')
plt.vlines(66900, -2, 2, colors = 'r')
plt.vlines(100032, -2, 2, colors = 'r')
plt.legend()
plt.show()
The graph above explains how we chose the data for training and testing
df_train_1 = df[0: 10000] # first normal set
df_test_1 = df[14000:43089].reset_index() # first abnormal train test
df_test_2 = df[43089: 66900].reset_index() # second abnormal train test
df_train_2 = df[66900:100032] # second normal set
df_test_3 = df[100032:].reset_index() # third abnormal train test
df_train = pd.concat((df_train_1, df_train_2), axis = 0).reset_index() # train on all normal data
df_test = df_test_3[:5000] # set to test the train
df_test_perf = df[43089:]
III. The algorithm¶
The LSTM takes as argument 3 points which represent the values of the 3 features at time $t$ and returns the feature "Température palier étage 1" at time $t+1$. More precisely, the three input features are "Température palier étage 1", "MOTOR CURRENT" and "Température huile sortie réfrigerant" which are decorrelated.* The model is trained on normal data (with no anomaly) so that it can predict the signal correctly when it is dealing with a normal set and diverge when encountering an abnormal set. By calculating the difference between the prediction and the actual data, it is therefore possible to tell whether there is an anomaly or not.
*See LSTM_ANNEXE for more information about the choices of features, and in particular the influence of the number of features on the precision of the LSTM
Preparation of the data¶
The next functions build the data to put into the LSTM.
input_features = [dico["Température palier étage 1"], dico["MOTOR CURRENT"], dico["Température huile sortie réfrigerant"]]
output_features = [dico["Température palier étage 1"]]
input_dim = len(input_features)
output_dim = len(output_features)
sample_size = 20
def compute(df, window) :
"""
return a dataset adapted to build the data
"""
df = df.loc[:, input_features]
for value in range(1, window) :
for feature in input_features :
df[feature + str(value)] = df[feature].shift(value)
return df
def form_data(df, window) :
"""
build the data with a dataset from compute
"""
pred = 1 # we predict the next point
df = compute(df, window)
data = []
label = []
l = len(df)
for rang in range(l - window - pred + 1) :
local_data = []
local_label = []
for feature in input_features :
local_data.append(list(df.loc[rang + window - 1, feature::len(input_features)])[::-1])
for feature in output_features :
if pred > 1 :
local_label.append(list(df.loc[rang + window, feature : feature + str(pred - 1): len(input_features)])[::-1])
elif pred == 1 :
local_label.append(list(df.loc[rang + window, feature : feature]))
data.append(local_data)
label.append(local_label)
data = np.array([np.array(sample).T for sample in data])
label = np.array(label)
return data, label
data_train, labels_train = form_data(df_train, sample_size)
data_test, labels_test = form_data(df_test, sample_size)
Training¶
This step intends to train the LSTM. It has to learn to recognize normal data and to become able to predict future points when the data is normal.
After a session of training, we test the model on test data.
We change the parameters to minimize the loss and obtain the most accurate model. The main parameters to modify are:
- sample_size : the size of the window (number of points taken to calculate the next one)
- batch_size : the number of vectors the algorithm takes at each iteration
- learning_rate : the learning rate in the LSTM model. It balances precision and time (the more slowly you go forward, the more accurate you are).
- nb_layers : the number of layer in the neural network
- hidden_size : the size of the LSTM ouptut
- loss_func : the function used to calculate the difference between the prediction and the real signal
- optimizer : the function that we use to modify the weights and biases of the model (typically stochastic gradient descent or Adam optimizer)
- epoch : the number of times the model is trained on the dataset
After this step, we have chosen to use the Adam optimizer and the following hyperparameters :
#Hyperparameters
sample_size = 20
batch_size = 100
num_workers = 8 #improves the speed of the training step
nb_layers = 1
learning_rate = 0.0001
hidden_size = 10
loss_func = th.nn.MSELoss()
min_epochs = 20
max_epochs = 60
The class FeaturesDataset prepares the data by dividing the actual data and the labels.
class FeaturesDataset(Dataset) :
def __init__(self, data, labels) :
self.data = [th.tensor(values) for values in data]
self.labels = [th.tensor(label) for label in labels]
def __len__(self) :
return len(self.labels)
def __getitem__(self, idx) :
return self.data[idx], self.labels[idx]
The LSTM model¶
The class below is the actual LSTM model. We have decided to use the LSTM function from the PyTorch library, then a non-linear function (ReLU), then a linear function and finally a Tanh function to return the result. The alternance of linear and non-linear functions improves the accuracy of the model. Notably, it adds more parameters that can be optimize to learn the normal data more accurately.
All these choices have been made empirically to minimize the train and test losses.
writer = SummaryWriter()
class Model(nn.Module) :
def __init__(self, input_dim, output_dim) :
super(Model, self).__init__()
self.LSTM = th.nn.LSTM(input_dim, hidden_size, nb_layers, batch_first = True, bidirectional = False)
self.non_lin = th.nn.ReLU()
self.lin = th.nn.Linear(hidden_size, output_dim)
self.lr = learning_rate
self.loss_function = loss_func
def forward(self, x) :
x, _ = self.LSTM(x)
x = x[:, -1, :].squeeze()
x = self.non_lin(x)
x = self.lin(x)
return th.nn.Tanh()(x)
def training() :
"""
trains and then tests the trained LSTM
"""
model = Model(input_dim = input_dim, output_dim = output_dim).to(device)
optimizer = th.optim.Adam(model.parameters(), lr = learning_rate)
test_loss_global = 10
epoch = 0
arret = False
while not arret :
epoch += 1
batch_loss = 0
for data, target in train_dataloader :
data = data.float().to(device)
target = target.squeeze().unsqueeze(1).to(device)
optimizer.zero_grad()
output = model(data) # contains sample_size * nb_features points
loss = loss_func(output.float(), target.float())
loss.backward()
optimizer.step()
batch_loss += loss.item()
if epoch % 2 == 0 :
writer.add_scalar("train_loss", batch_loss, epoch)
print(f"epoch = {epoch}")
test_loss = 0
for data, target in test_dataloader :
data = data.float().to(device)
target = target.squeeze().unsqueeze(1).to(device)
output = model(data)
loss = loss_func(output.float(), target.float())
test_loss += loss
print(test_loss)
writer.add_scalar("test_loss", test_loss, epoch)
if (test_loss > test_loss_global and epoch > min_epochs) or epoch > max_epochs :
print(epoch)
arret = True
test_loss_global = test_loss
writer.flush()
writer.close()
We have decided not to include the actual training in this notebook, since the code takes time to execute. Nonetheless, we have presented above the code that we used for the training.
At the end of the training step, we had a test_loss of $10^{-5}$. We saved this model in the file 'model_complet.pt' that we use below.
A test on a normal set not in the training set¶
model = th.load('model_complet_lstm.pt')
Of course, the test step is done on data distinct from training data. For these tests, we use the following functions : model_prediction to run the model and show_curve_diff to plot graphically the results.
def model_prediction(model, dataloader) :
"""
applies the LSTM model to a dataset
returns the actual curve and the predicted curve
"""
courbe_theo = []
courbe_calc = []
for data, target in dataloader :
data = data.float().to(device)
target = target.to(device)
output = model(data)
for i in output :
for j in i :
courbe_calc.append(j.item())
for k in target :
for j in k :
courbe_theo.append(j.item())
return courbe_theo, courbe_calc
def show_curve_diff(courbe_theo, courbe_calc, threshold = 0.001, tolerance = 5, alerts = []) :
"""
plots the superposition of the actual and predicted curve
and the difference between these two curves compared with the threshold applied to the data
"""
plt.figure(figsize = (30, 6))
for rang, valeur in enumerate(alerts) :
if valeur == 2 :
plt.vlines(rang, -.5, .5, colors='r')
elif valeur == 1 :
plt.vlines(rang, -.5, .5, colors='orange')
plt.plot(courbe_theo, label = 'Real signal')
plt.plot(courbe_calc, label = 'Prediction')
plt.legend()
plt.show()
plt.figure(figsize = (30, 6))
diff = np.array(courbe_theo) - np.array(courbe_calc)
plt.plot(diff, label = 'Difference between real and prediction')
limit = threshold*tolerance
plt.hlines([-limit, limit], 0, len(courbe_theo), color = 'r', label = 'Threshold * tolerance')
plt.legend()
plt.show()
val = FeaturesDataset(data_test, labels_test)
test_dataloader_normal = DataLoader(val, batch_size = batch_size, shuffle = False, num_workers = num_workers)
show_curve_diff(*model_prediction(model, test_dataloader_normal))
We can notice on the above figure that the LSTM model's predicted curve fits very well to the real curve. Some peaks go below or above the limit that we set (see below for how we determined this value) but these false alarms are tolerated since rather scarce.
Testing on abnormal set¶
data_l_test_1, labels_l_test_1 = form_data(df_test_1, sample_size)
data_l_test_2, labels_l_test_2 = form_data(df_test_2, sample_size)
data_l_test_3, labels_l_test_3 = form_data(df_test_3, sample_size)
val_1 = FeaturesDataset(data_l_test_1, labels_l_test_1)
val_2 = FeaturesDataset(data_l_test_2, labels_l_test_2)
val_3 = FeaturesDataset(data_l_test_3, labels_l_test_3)
test_dataloader_1 = DataLoader(val_1, batch_size = batch_size, shuffle = False, num_workers = num_workers)
test_dataloader_2 = DataLoader(val_2, batch_size = batch_size, shuffle = False, num_workers = num_workers)
test_dataloader_3 = DataLoader(val_3, batch_size = batch_size, shuffle = False, num_workers = num_workers)
show_curve_diff(*model_prediction(model, test_dataloader_1))
show_curve_diff(*model_prediction(model, test_dataloader_2))
show_curve_diff(*model_prediction(model, test_dataloader_3))
When tested on abnormal data, the LSTM's prediction does not fit the actual signal. Thus, we decide to declare a set of data abnormal when the difference between the real and the predicted signals exceeds a certain threshold, multiplied by a tolerance.
We take a threshold of $10^{-3}$ since the precision of the prediction on normal data is lower than $10^{-3}$.
We explain later in this notebook how the default value of the tolerance (the tolerance can be modified by the user) will be set.
IV. Global solution (comparing prediction and real signal)¶
Raising an alarm when encountering an anomaly¶
The following section aims at defining and implementing criterias for when to raise an alarm when an anomaly is detected.
The steps are the following :
- Run the neural network model on the 5000 last points (length_input)
- Compare the predicted signal with the actual one by computing the difference between the two
- Label the points corresponding to a difference greater than the threshold*tolerance betweenthe two signals
- In order not to label as normal an area preceded closely by abnormal data, we used a buffer_time (commonly decided with the other algorithms) corresponding to the length of the area on which we maintain an alert status after the anomaly. The anomaly will be RED and the hazardous area (buffer) will be in ORANGE.
- Finally, in order not to maintain an alert status on normal data, we have decided not to display the orange area when the preceding period of length buffer_time (i.e. the last 3.5 days) presents less than 12 points of anomaly (i.e. 2 hours) cumulated.
threshold = 0.001
length_input = 5000 + sample_size
buffer_time = 504
def local_anomaly(predicted_signal, actual_signal, threshold, tolerance) :
"""
returns a boolean indicating if the given signals differ by more than threshold*tolerance anywhere on the period
"""
predicted_signal = np.array(predicted_signal)
actual_signal = np.array(actual_signal)
diff = actual_signal - predicted_signal
max_diff = max(abs(diff))
if max_diff > threshold * tolerance :
return True
return False
def anomaly(dataframe, length_input, tolerance = 5, threshold = 0.001, plot = True) :
"""
returns the actual curve, the predicted curve and a list of int indicating the colour of every point :
green, orange or red
"""
dataframe_calc = dataframe.loc[len(dataframe) - length_input:, :]
dataframe_calc = dataframe_calc.reset_index(drop = True)
data, labels = form_data(dataframe_calc, sample_size)
inp = FeaturesDataset(data, labels)
dataloader = DataLoader(inp, batch_size = batch_size, shuffle = False, num_workers = num_workers)
courbe_theo, courbe_calc = model_prediction(model, dataloader)
alerts = []
for point_calc, point_theo in zip(courbe_calc, courbe_theo) :
if local_anomaly([point_calc], [point_theo], threshold, tolerance) :
alerts.append(1)
else:
alerts.append(0)
for rang, valeur in enumerate(alerts) :
if valeur == 1 :
alerts[rang] = 2 # red
elif rang >= buffer_time and alerts[rang - buffer_time:rang].count(2) >= 12 :
alerts[rang] = 1 # orange
else:
alerts[rang] = 0 # green
if plot :
show_curve_diff(courbe_theo, courbe_calc, alerts = alerts, tolerance = tolerance, threshold = threshold)
return courbe_theo, courbe_calc, alerts
_,_,_ = anomaly(df.loc[80000:90000].reset_index(), length_input)
On this normal dataset, we can notice that - logically enough - very few anomalies are detected. And none of these anomalies last for long enough to trigger the display of an orange alert area.
Performance in terms of accuracy¶
In this section, we evaluate the performance of our algorithm on a test set : df_test_perf, composed of normal data and of abnormal data.
We evaluate this performance by computing the recall $(\frac{number\,of\,true\,positives}{number\,of\,actual\,errors})$, the precision $(\frac{number\,of\,true\,positives}{number\,of\,true\,positives\,+\,number\,of\,false\,positives})$ and the number of false negatives.
tolerance = 5
length_input = 92000 + sample_size # we set the length_input from 5000 to 92000 because the dataset df_test_perf is large and we want to evaluate the model on the whole dataframe
def transform_0_1(i) :
if i != 0 : return 1
return 0
def performances(df, tolerance) :
label = list(df['labels'])
label_0_1 = [transform_0_1(i) for i in label][sample_size:]
nb_error = label_0_1.count(1)
true_positives = 0
false_positives = 0
false_negatives = 0
true_negatives = 0
courbe_theo, courbe_calc, alerts = anomaly(df, length_input=len(df), tolerance = tolerance)
for prediction, real in zip(alerts, label_0_1) :
if prediction == real :
if prediction == 1:
true_positives += 1
else:
true_negatives += 1
else:
if prediction == 0 :
false_negatives += 1
elif prediction == 1:
false_positives += 1
else:
if real == 1 :
true_positives += 1
else:
false_positives += 1
recall = true_positives / nb_error
precision = true_positives / (true_positives + false_positives)
return recall, precision, false_negatives
Tolérance : 1 --> Recall : 1.0 -- Precision : 0.262750285611208 -- False negatives : 0
# 
Image(filename ='images_lstm/im1.png')
Tolérance : 2 --> Recall : 0.9993592384090805 -- Precision : 0.39346596028399466 -- False negatives : 14
# 
Image(filename ='images_lstm/im2.png')
Tolérance : 3 --> Recall : 0.9885578287335804 -- Precision : 0.44596547737033365 -- False negatives : 250
# 
Image(filename ='images_lstm/im3.png')
Tolérance : 4 --> Recall : 0.9860863197400339 -- Precision : 0.5091815754023586 -- False negatives : 304
# 
Image(filename ='images_lstm/im4.png')
Tolerance : 5 --> Recall : 0.9813263764932033 -- Precision : 0.5623868852459016 -- False negatives : 408
# 
Image(filename ='images_lstm/im5.png')
Tolerance : 6 --> Recall : 0.9753764474346652 -- Precision : 0.5923177409044165 -- False negatives : 538
# 
Image(filename ='images_lstm/im6.png')
Tolerance : 7 --> Recall : 0.9602270126779258 -- Precision : 0.6091400034841182 -- False negatives : 869
# 
Image(filename ='images_lstm/im7.png')
Tolerance : 8 --> Recall : 0.9572520481486567 -- Precision : 0.6098556640909754 -- False negatives : 934
# 
Image(filename ='images_lstm/im8.png')
Tolerance : 9 --> Recall : 0.9553297633758983 -- Precision : 0.6111794331225111 -- False negatives : 976
# 
Image(filename ='images_lstm/im9.png')
Below, are summarised the recalls, precisions and numbers of false negatives obtained for different values of tolerances.
The number of false negatives, that might seem significant at first sight, is not that critical, since many false negatives actually appear between alerts.
The preceding table justifies the choice of 5 as a default value for the tolerance (modifiable by the user).
This choice gives a $F_{1}$ score of : $$F_1 = \frac{2}{\frac{1}{recall}+\frac{1}{precision}} = 0.71$$
This $F_1$ score could be improved with other values of tolerance but we have decided to privilege a low number of false negatives since it would be too risky to have a too high false negatives rate.
Performance in terms of speed¶
In this section, the Y axis represents the position in the dataset. The X axis is only used for representation. The red horizontal lines represent a labelled problem on the Y axis. Every blue point is a problem detected by the LSTM on the Y axis.
def speed_evaluation(tolerance) :
"""
on all dataset df
plots vertical lines when axis Y is labelled as an error
plots dots when axis Y is predicted as an error
"""
stop = False
count = 0
lst = []
while not stop :
df_test = df[count:count + 50].reset_index()
count += 50
ano = anomaly(df_test, 50, tolerance = 5, plot = False)[2]
if 2 in ano or 1 in ano :
lst.append(count)
print(count)
if count >= 135000 :
stop = True
plt.scatter(np.arange(len(lst)), lst)
label_index = label[label != 0].index
plt.plot(lst)
print(len(lst))
plt.hlines(label_index, 0, 1400, color = 'r')
Tolerance = 1¶
# 
Image(filename ='images_lstm/rapidity1.png')
As the graph above illustrates, the LSTM detects some errors during normal periods. On the other hand, it is very efficient to detect anomalies : many points are in the red zone and the blue curve stays for long in red zones.
The beginning of the detection can be estimated when the anomaly is detected by comparing the first concentration of blue dots approching red zones. The results are quite the same for the 3 zones : the LSTM predicts the error very soon because it detects alarms a lot.
Tolerance = 5¶
Tolerance equals 5 gives better result, in particular in not detecting too many errors on normal data. The anomalies are this time detected about 500 points before the red zone. Since the computation is very long, only an image of the result is shown but one may obtain the same result by using the function speed_evaluation.
# 
Image(filename ='images_lstm/rapidity5.png')
V. Conclusion¶
To conclude, we have created an algorithm that uses an LSTM neural network in order to predict the signal and compare it with the actual one. When these two signals are different enough, an anomaly is therefore detected.
We have decided (commonly with the algorithms from the other groups) to use a color code to indicate abnormal areas (red) and hazardous areas (orange).
Performance¶
Our LSTM model has proven great efficiency on the test data of the given period. The areas that we had labelled as abnormal are clearly identified by the algorithms, since they appear either in red or in orange.
The parameters that we chose allow to limit the number of false negatives but prevent the model to cause too many false positives either.
With the chosen parameters, our algorithm has a recall of $0.98$ and a precision of $0.56$.
Limits¶
As explained above, the apparently significant number of false negatives is not such a problem since they are very scarce and are mostly accumulated in zones were the algorithms has already detected anomalies.
The precision could probably be improved, notably by training on more data. Improving this feature will decrease the number of false alerts. However, privileging recall over precision has been an engineering choice since the phenomenon we intend to detect is rare but might have serious consequences.
Finally, this algorithm would probably need re-training on other datasets in order to adapt to different types of data, in particular if we want to generalize on other machines.