EMD¶
I. Implementation and display¶
Generate IMFs¶
from IPython.display import Image
In order to analyze the signal, we are going to apply an algorithm based on Empirical Mode Decomposition (EMD). It's a recursive method which squeezes the function to decompose inside of an envelope and computes a smooth mean curve which roughly corresponds to one of the original function's harmonics. Consider a function $x$ that we want to decompose using EMD. We compute its minimums and maximums, and interpolate between those points using a 3-degree polynomial:
Image("images_emd/imageemd1.png")
(Source : Empirical mode decomposition of field potentials from macaque V4 in visual spatial attention, PubMed, Hualou Liang, Steven L Bressler, Elizabeth A Buffalo, Robert Desimone)
Then, we substract the mean of those two functions to the original function $x$, and we continue until the resulting function as a mean value of $0$. We call this function an IMF (Intrinsic Mode Function), and we continue this process until the resulting IMF is monotonic :
Image("images_emd/imageemd2.png")
(Source : Empirical mode decomposition of field potentials from macaque V4 in visual spatial attention, PubMed, Hualou Liang, Steven L Bressler, Elizabeth A Buffalo, Robert Desimone)
Subtracting each mode from the function one by one allows us to decompose the function into a sum of orthogonals harmonics on which we can then focus. The next step is focusing on a harmonic of the desired pseudo-period, meaning one the period of which is approximately equal to that of the periodic error to account for.
We are going to import all useful functions to compute the EMD algorithm and display IMFs, here we use the temperature feature as an example, just before an anomaly :
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from PyEMD import EMD
#The detrended data
df = pd.read_csv('clean_data_detrended.csv')
debut = 50000
delta = 5000
fin = debut + delta
#We want to display the temperature feature, where the anomalies are the most visible
t=np.arange(debut,fin+1)
s=np.array(df.loc[debut:fin,'Température palier étage 1'].astype(float))
def generateIMF(t,s):
return EMD().emd(s,t)
def displayIMF(imf):
plt.figure(figsize=(50,20))
N = imf.shape[0]+1
nbre_inf = N #The number of IMFs we want to display, it stops when there is no more
plt.subplot(nbre_inf+1,1,1)
plt.plot(df.loc[debut:fin,'Température palier étage 1'].astype(float))
plt.plot(df.loc[debut:fin,'labels'])
plt.title("Signal original")
plt.xlabel("Time [s]")
i=0
for n, imf in enumerate(imf):
plt.subplot(nbre_inf+1,1,n+2)
plt.title("IMF "+str(n+1))
plt.xlabel("Time [s]")
plt.plot(t,imf,'g')
i+=1
if i>=nbre_inf:
break
Here is an example on the detrended signal, at the beginning of an anomaly, we can clearly see on the two first IMFs that the problem is visible : (as the signal is detrended, it is hard to see the problem on it but we have the labels to be sure).
IMF = generateIMF(t,s)
displayIMF(IMF)
Hilbert transform on IMFs¶
We can already see that the problem is visible on the first IMF of the signal, but this is not enough. One of the main benefits of the EMD is that we can apply the Hilbert transform on IMFs quite easily. The Hilbert Transform is defined as the convolution product of the signal and $h$ defined as $h(t) = \frac{1}{\pi t}$.
The Hilbert transform create a new signal $y(t) = H(x)(t)$, we can then create an analytical signal: $z(t) = x(t) + iy(t)$.
We can define the amplitude $a(t)$ of the signal as the module of $z(t)$, and the instantaneous frequency $f(t)$ of the signal as the derivative of its argument.
We are going to see that the anomalies on $x(t)$ are clearly visible on the amplitude, but the result on the frequency is less obvious, even if we are going to use it later on the Hilbert spectrum. Let's compute the amplitudes $a_k(t)$ of the IMFs, we first import necessary packages to compute Hilbert transform:
from scipy.misc import derivative
from scipy.signal import hilbert,chirp
def hilb_mod(signal):
z = hilbert(signal)
return abs(z)
def hilb_freq(signal, dx):
z = hilbert(signal)
theta = np.angle(z)
omega = np.diff(np.unwrap(theta))/dx
return omega
def displaya(imf, nbre_imf = IMF.shape[0]+1):
plt.figure(figsize=(50,20))
plt.subplot(nbre_imf+1,1,1)
plt.plot(df.loc[debut:fin,'Température palier étage 1'].astype(float))
plt.plot(df.loc[debut:fin,'labels'])
plt.title("Signal x(t)")
plt.xlabel("Time [s]")
i=0
for n, imf in enumerate(IMF):
plt.subplot(nbre_imf+1,1,n+2)
plt.title("IMF "+str(n+1))
plt.xlabel("Time [s]")
plt.plot(t, hilb_mod(imf), 'g')
i+=1
if i>=nbre_imf:
break
IMF = generateIMF(t,s)
taille = IMF.shape[0]+1
displaya(IMF, taille)
The problem in the original signal is now clearly visible with a rise in the amplitude ! Let's now display the instantaneous frequencies of the IMFs :
def displayf(imf, nbre_imf = IMF.shape[0]+1):
plt.figure(figsize = (50,20))
plt.subplot(nbre_imf+1,1,1)
plt.plot(df.loc[debut:fin,'Température palier étage 1'].astype(float))
plt.plot(df.loc[debut:fin,'labels'])
plt.title("Signal x(t)")
plt.xlabel("Time [s]")
i=0
for n, imf in enumerate(IMF):
plt.subplot(nbre_imf+1, 1, n+2)
plt.title("IMF " + str(n+1))
plt.xlabel("Time [s]")
plt.plot(t[:-1], hilb_freq(imf,1), 'g')
i += 1
if i >= nbre_imf:
break
IMF = generateIMF(t,s)
taille = IMF.shape[0]+1
displayf(IMF, taille)
There is no obvious pattern in frequencies, but they are going to be very useful to display the Hilbert spectrum of the signal, which is a digest of all informations.
Hilbert spectrum¶
The Hilbert spectrum displays the amplitude of the signal function of the period and the time, it is a 3-dimensional graph, so we've chosen to display amplitude with different colors. We are going to define the function to display this spectrum :
# Useful function to invert lines
def inv_ligne(matrice):
L = []
for i in range(len(matrice)):
l = []
for j in range(len(matrice[0])):
l.append(matrice[i][j])
L.append(l)
L.reverse()
return np.array(L)
mpl.rcParams['figure.figsize']=[15,10]
def hilb_spec(s,t,img_height):
mi_f, ma_f = 0,0
n = len(t)-1
image = np.zeros((img_height,n//20+1))
freq = hilb_freq(s,1)
amp = hilb_mod(s)[:-1]
ma_amp = np.max(amp)
mi_f = min(mi_f,np.min(freq))
ma_f = max(ma_f,np.max(freq))
for i in range(n):
t_bin = i//20
if freq[i] <= 0:
continue
p_bin = int(np.log2(1/freq[i])*10)
if p_bin >= img_height or p_bin < 0:
continue
image[p_bin][t_bin] += ma_amp-amp[i]
image = inv_ligne(image)
return image, freq, amp, mi_f, ma_f
def hilb_spec_imf(s,t,img_height,imfs):
n = len(t) - 1
im_fin = np.zeros((img_height,n//20+1))+0.01
i=0
for h in imfs:
i += 1
#if i>3 :
# break
image, freq, amp, mi_f, ma_f = hilb_spec(h,t,img_height)
im_fin += image
plt.imshow(np.log(im_fin),cmap = plt.cm.plasma)
plt.ylabel('period')
plt.xlabel('time')
y_ticks_location = np.arange(0,img_height,20)
y_ticks_labels = np.arange(img_height,0,-20)
plt.yticks(y_ticks_location,y_ticks_labels.astype(int))
x_ticks_location = np.linspace(0,250,10)
plt.xticks(x_ticks_location,np.linspace(debut,fin,10).astype(int))
cbar=plt.colorbar()
cbar.set_label('amplitude')
We can now display the Hilbert spectrum !
IMF = generateIMF(t,s)
hilb_spec_imf(s,t,300,IMF)
Let's compare it with a period with no anomalies :
# We display from 0 to 5000
debut = 0
fin = debut + delta
t=np.arange(debut,fin+1)
s=np.array(df.loc[debut:fin,'Température palier étage 1'].astype(float))
IMF = generateIMF(t,s)
hilb_spec_imf(s,t,300,IMF)
It's not obvious on this spectrum, but we can see a slight rise in amplitude in the high periods modes, which corresponds to the actual anomalies !
II. Detection of the anomalies¶
Using a threshold¶
We are now going to explain how we can detect the anomalies in the original signal, as we saw above, the amplitude graph of the analytical signal $z(t) = x(t) + iy(t)$ seems to be the best indicator of oscillations.
One of the first possible ideas to detect a rise in amplitude is to simply put a threshold with a value of 0.1 : any value above would be considered as an error. Let's try it on a period without problems first :
debut = 0
delta = 5000
fin = debut + delta
t=np.arange(debut,fin+1)
s=np.array(df.loc[debut:fin,'Température palier étage 1'].astype(float))
plt.plot(t,s)
[<matplotlib.lines.Line2D at 0x7f4e2b30f090>]
We see oscillations, but according to the labels, this zone is without problems, let's see the amplitude of the first IMF (it's the most visible one on short periods of time).
IMF = generateIMF(t,s)
displaya(IMF,1)
# We plot a threshold of 0.1
plt.plot(t,np.ones(len(t))*0.1)
[<matplotlib.lines.Line2D at 0x7f4e2aeb2fd0>]
The threshold doesn't detect any anomaly, which is good news, let's now test it on the beginning of a problem :
debut = 50000
delta = 5000
fin = debut + delta
t=np.arange(debut,fin+1)
s=np.array(df.loc[debut:fin,'Température palier étage 1'].astype(float))
plt.plot(t,s)
[<matplotlib.lines.Line2D at 0x7f4e2b221890>]
IMF = generateIMF(t,s)
displaya(IMF,1)
# We plot a threshold of 0.1
plt.plot(t,np.ones(len(t))*0.1)
[<matplotlib.lines.Line2D at 0x7f4e2afcbc10>]
The threshold works perfectly and detects the problem! But it's not convincing as the threshold was chosen empirically and is not linked to the intrinsic structure of the signal: any change in the temperature measuring would make the threshold useless, so we have to develop a more robust method.
An improvement of the threshold method¶
The idea of a threshold is good and works quite well, but we have to adapt it, to link it to the signal itself and not to empirical observations.
To do so, we calculated the mean $\mu$ and standard deviation $\sigma$ on "healthy" data, in order to detect a change of distribution when there is an anomaly. We introduce a new parameter $\alpha$, and we choose to detect a problem if the amplitude becomes superior to $\mu + \alpha \sigma$. This way, the threshold is adapted to the data itself, and isn't so vulnerable to change in the way we measure temperature.
We are going to calculate $\mu$ and $\sigma$ on all healthy values using the following functions:
def mean_std_sign(s,t):
#This function computes the mean and standard deviation of the amplitude of the first IMF of a signal,
#the slicing is used to get rid of side effects
imfs = EMD().emd(s,t)
imf1 = imfs[0]
module = hilb_mod(imf1)[50:fin-debut-50]
return (np.mean(module),np.std(module))
def mean_std_donn(l, begin, end, delta, shift = 1000):
#This function gives the moving average of the amplitude of the first IMF of a signal l(t)
debut = begin
fin = debut + delta
moyenne = []
ecart = []
for i in range((end - delta)//shift):
t = np.arange(debut,fin+1)
s = np.array(l.loc[debut:fin,'Température palier étage 1'].astype(float))
if np.all(l.loc[debut:fin,'labels'] == np.zeros((fin-debut+1))) == True: #Here we only consider slots where there is no problem
p = mean_std_sign(s,t)
moyenne.append(p[0])
ecart.append(p[1])
debut += shift
fin += shift
return moyenne, ecart
The second function is used to compute the moving average of a signal with a specified shift, begin, end and delta. This way, it will be more precise and adapted to the real situation.
We are going to see how this process is applied to real-time data in the next part, let's just compute the moving mean of all the data (it's an array).
total_mean, total_std = mean_std_donn(df, 0, 130000, 5000)
plt.plot(total_mean)
plt.plot(total_std)
[<matplotlib.lines.Line2D at 0x7f4e2af3ff50>]
Now, we have trained our algorithm ! We know the "healthy" values of the mean, and we can define the global mean and standard deviation of the signal with :
globalMean = np.mean(total_mean)
globalStd = np.mean(total_std)
print(globalMean, globalStd)
0.02646221315125416 0.022649730749175017
Simulation of real-time process¶
Of course, during the real-time process, data will arrive to the algorithm point after point, so how are we going to deal with it? And how do we simulate it to test our method?
This choice is the result of a compromise between the precision of the algorithm and the speed of detection : we have to detect the problem before it gets out of control. But a lot of points are required to detect a rise in amplitude.
We chose to use a delta of 5,000 points, it's enough to detect an anomaly in a reasonable amount of time, and we set the shift the detection window to 144 points (equivalent to one day, we refresh the data daily).
The process is very simple, for each detection window, we compute the mean and standard deviation and apply the classification we explained above : if $a(t) > \mu + \alpha \sigma$, then we add a $1$ to the predict array : this is the alarm.
We are going to simulate the real-time process in the same way we calculated the moving average above : a detection window shifting through the data as if the data was coming into the algorithm day by day. We created two level of alarm : $2$ means that the problem is here, $1$ is a state of alert after a recent detection of a problem and $0$ is the absence of problem.
def alarme_EMD(l,requested_column, begin, end, alpha, delta = 5000, shift = 144, N = 24,buffer = 500):
debut = begin
fin = delta + begin
predict = [0 for i in range(delta-shift)] #the delta-shift first points of the data are only usd to predict the others so labelled at 0
for i in range((end-delta-begin)//shift):
t = np.arange(debut,fin + 1)
s = np.array(l.loc[debut:fin,requested_column].astype(float))
imfs = EMD().emd(s,t)
module = hilb_mod(imfs[0])
for j in range(delta-shift,delta):
if module[j] > globalMean + alpha*globalStd:
predict.append(2)
else:
predict.append(0)
debut += shift
fin += shift
for k in range(1,len(predict)-N-1):
if predict[k] == 2 and predict[k-1] == 0:
if predict[k:k+N] != [2 for i in range(N)]:
predict[k:k+N] = [0 for i in range(N)]
for j in range(buffer,len(predict)):
if predict[j] == 0:
if 2 in (predict[j - buffer : j]) :
predict[j] = 1
return predict #In the real process we only return the last "delta" values but here, to visualize the alarme
#on the whole signal, we don't crop it
Now let's try this algorithm on all the data and compare it to the labels !
plt.figure(figsize=(50,20))
debut = 0
fin = 135000
plt.plot(df.loc[debut:fin,'labels'])
plt.plot(df.loc[debut:fin,'Température palier étage 1'].astype(float));
l = alarme_EMD(df,'Température palier étage 1', begin = debut, end = fin, alpha = 1.8, delta = 5000 , shift = 144 , N = 24, buffer = 500)
plt.plot(l)
plt.xticks(np.arange(debut, fin, 2000.0))
plt.grid()
Globally, we have some false positive alerts, but all the anomalies are detected relatively quickly by our algorithm ! We also must take into account the gapfilling flaws that are seen as anomalies... As a conclusion, this function is satisfying and allows us to detect oscillations problems.
We are now going to be more precise about the choice of parameters like $\alpha$, $N$... And estimate the precision and recall of our algorithm.
III. Choice of parameters and the error¶
The method to estimate the precision and recall is very simple : we compare the count of real problems and the count of detections by our algorithm. We recall the definition of the accuracy and the recall :
With :
- $T_P$ the number of true positive
- $F_N$ the number of false negative
- $F_P$ the number of false positive
Considered this, we are now going to implement two functions : one to choose the best $\alpha$ and one to choose the best $N$ by displaying the recall and precision for different values.
def choice_parameter(alpha, N):
predict = alarme_EMD(df, 'Température palier étage 1', begin = 0, end = 135000, alpha = alpha, delta = 5000 , shift = 144, N = N, buffer = 500)
vrais_pos_det = []
labels = np.array(df.loc[0:135000,'labels'])
for i in range(0,134000):
if predict[i] == 2:
predict[i] = 1
if labels[i] < 0:
labels[i] = 0
if predict[i] == 1 and labels[i] != 0:
vrais_pos_det.append(1)
if labels[i] != 0:
labels[i] = 1
recall = sum(vrais_pos_det)/sum(labels)
prec = sum(vrais_pos_det)/sum(predict)
return recall, prec
Let's set the $N$ value to 20 and choose the best value of $\alpha$ !
#for i in range(10):
# recall, prec = choice_parameter(1.5 + 0.1*i, 20)
# print(recall, prec)
#for i in range(10):
# recall, prec = choice_parameter(2.5 + 0.1*i, 20)
# print(recall, prec)
It seems that the best compromise is for $\alpha = 2,8$. Let's now set the value of $\alpha$ and choose the best value of $N$.
#for i in range(10):
# recall, prec = choice_parameter(2.8, 15 + 2*i)
# print(recall, prec)
Here, the best compromise is $N = 27$, let's now make a final test to refine the value of $\alpha$ :
#for i in range(10):
# recall, prec = choice_parameter(2.5 + 0.1*i, 27)
# print(recall, prec)
choice_parameter(2.8, 27)
(0.7950337150808091, 0.5263108171941426)
Finally, we set the values to : $\alpha = 2,8$ and $N = 27$, we have a precision of $0,52631$ and a recall of $0.79503$.
Now let's see the results of these new parameters :
plt.figure(figsize=(50,20))
debut = 0
fin = 135000
plt.plot(df.loc[debut:fin,'labels'])
plt.plot(df.loc[debut:fin,'Température palier étage 1'].astype(float));
l = alarme_EMD(df, 'Température palier étage 1', begin = debut, end = fin, alpha = 2.8, delta = 5000 , shift = 144 , N = 27, buffer = 500)
plt.plot(l)
plt.xticks(np.arange(debut, fin, 2000.0))
plt.grid()
### Hilbert spectrum modified to output the adequate red/orange/green zones
mpl.rcParams['figure.figsize']=[15,10]
def hilb_spec_imf(s,t,img_height):
imfs=EMD().emd(s,t)
n=len(t)-1
im_fin=np.zeros((img_height,n//20+1))+0.01
i=0
for imf in imfs:
i+=1
#if i>3 :
# break
image, freq, amp, mi_f, ma_f = hilb_spec(imf,t,img_height)
im_fin += image
predict=alarme_EMD(df,'Température palier étage 1', begin = debut-5000, end = fin, alpha = 2.2, delta =5000 , shift =144 , N =27, buffer=500 )[5000:]
for i in range(len(predict)):
if predict[i] ==2:
im_fin[:, i//20]+=0.05
elif predict[i] ==1:
im_fin[:, i//20]+=0.005
plt.imshow(np.log(im_fin),cmap=plt.cm.inferno)
plt.ylabel('period')
plt.xlabel('time')
y_ticks_location = np.arange(0,img_height,20)
y_ticks_labels = np.arange(img_height,0,-20)
plt.yticks(y_ticks_location,y_ticks_labels.astype(int))
x_ticks_location = np.linspace(0,250,10)
plt.xticks(x_ticks_location,np.linspace(debut,fin,10).astype(int))
cbar=plt.colorbar()
cbar.set_label('amplitude')
plt.savefig('try')
debut=29000
fin=34000
t=np.arange(debut,fin+1)
s=np.array(df.loc[debut:fin,'Température palier étage 1'].astype(float))
hilb_spec_imf(s,t,250)
plt.figure(figsize=(10,10))
debut = 29000
fin = 34000
plt.plot(df.loc[debut:fin,'labels'])
plt.plot(df.loc[debut:fin,'Température palier étage 1'].astype(float));
l = alarme_EMD(df,'Température palier étage 1', begin = debut-5000, end = fin, alpha = 1.8, delta = 5000 , shift = 144 , N = 24, buffer = 500)[-5000:]
t=np.arange(debut,fin)
plt.plot(t[-len(l):],l)
plt.xticks(np.arange(debut, fin, 2000.0))
plt.grid()