A Few Theoretical Ideas
13 January 2022
Our methods build on a special data structure called the matrix profile which computes the pairwise distances between subsequences of a given window length and records a list of ‘nearest neighbour indices’ (subsequences that match each other the closest in the time series); these indices play an important role in semantic segmentation. Each nearest neighbour index defines an ‘arc’ (like an arrow) from one data point to another, and for each data point, we can count the number of times an arc crosses above the point. When the statistical properties of the generating random process changes significantly over a period of time, the number of arcs crossing over points in and around the transition decreases significantly, since it is unlikely a subsequence of the first regime is a nearest neighbour for a subsequence of the second regime! After dividing by a correction factor (We need to account for the fact that in the baseline random-neighbours scenario, we naturally expect more crossovers for data points in the center than data points on the extremities.), we obtain the Corrected Arc Curve (CAC), which ‘reacts’ numerically to perceived regime changes (insert image of CAC). Even better, the matrix profile gives interpretable meaning to the anomalous transitions it detects. Contrast this with the difficult-to-interpret black-box methods popularised by neural learning, which is particularly inapplicable to high-risk operations (such as the building data set we consider in this blog post) where understanding how a model categorises failure/anomalies is almost as important as the categorisation itself. We refer to https://stumpy.readthedocs.io/en/latest/, in particular Tutorials/Semantic Segmentation, for more details on the theoretical foundations of the matrix profile.
Given a single time-series signal, we can apply the matrix profile, obtain the CAC and figure out where the transition occurs from this single source of information. However, information from a single CAC is often noisy by itself - see, for example, 3, where we can just about see three main ‘peaks’, but with a high degree of uncertainty. We have an array of 24 sensors in our data - how can we use this extra data to ensemble together predictions and increase stability?
Figure 3: CAC from a single signal. We can just about see three peaks in with large magnitudes (and a false positive on the left), but these features are not distinct at all. The next piece in the puzzle is the Wasserstein barycenter, a method from the theory of optimal transport used for combining information from multiple probability distributions while retaining the most important features from each distribution. The Wasserstein barycenter of a set of distributions is the minimising distribution of the squared sum of Wasserstein distances,
where each measures how much `energy' it takes to move the distribution to the distribution , and the are adjustable "cost metrics". A nice image that emphasises the importance of methods extending beyond naive Euclidean averaging for anomalous is Figure 4. Despite each distribution having the same anomalous spike feature, each data set being off-set in time from each other creates destructive interference after averaging. Comparing Figure 3 and Figure 5, the essential peaks become more pronounced and smoothed out after performing the barycenter. Figure 4: Naive Euclidean averaging destroys the main features from the anomalous spike from each distribution. Effectively working with sensor array data means being able to combine the information from each sensor while retaining the most important features - this is where the Wasserstein barycenter comes in! Image Credit: https://github.com/fpetitjean/DBA Figure 5: The same CAC as in the previous figure, but barycentered over all 24 sensors. Notice the much clearer and smoother peaks. We can now confidently note the locations of possible regime changes.
For a (more formal) introduction to the matrix profile, see the STUMPY documentation here. STUMPY is a code package implementing the matrix profile with efficient algorithms, and has been invaluable in the results of this project. For the Wasserstein barycenter, the library Python Optimal Transport (see https://pythonot.github.io/) was likewise essential to our work and provides an introduction to the use of optimal transport in data mining contexts.
Algorithm Description
Let's put all of these ingredients together into a coherent data pipeline. For each input signal, compute their VMD with modes (so each input returns output time series for each mode). Apply the Welch transform to each mode (we are now in the frequency domain), take the log and re-scale (this is just for smoothing purposes). Concatenate the transformed data belonging to the same sensor and same VMD together into one long time series. We now compute the matrix profiles and CACs of each concatenated sequence. Each of these CACs will 'respond' in some way to a perceived anomalous transition (see image of CAC dip), and we wish to combine these responses together. This is where the Wasserstein barycenter comes in! We barycenter the VMDs together (so 5 VMDs become a single time series) and finally compute the barycenter of the data over the sensors (so 24 sensors worth of data becomes a single time series). Peaks of this final CAC corresponds to segments of high 'disruption', in the sense that subsequences to the left of the peak tend to be related to other subsequences on the left, and subsequences to the right tend to be related to other subsequences on the right. Hence, it makes sense to postulate that an anomalous transition has taken place in-between.
See Figure 6 for a flowchart of the proposed (offline) methodology that summarises our explanations above. It works `offline' since we assume all of the data has completely arrived before we process it and figure out where the anomalous transitions take place (if there are any!). Figure 6: Offline End-to-End algorithm flowchart. Flowchart diagram for online anomaly detection on the three-storeyed building data set. We take a subset of the data from each sensor, apply variational mode decomposition to obtain the quasi-orthogonal components, Welch transform then take the logarithm and normalise. The transformed data is then concatenated together. We record the updated CACs every 100 new data points, barycenter across the variational modes then finally barycenter over all sensors.
As it turns out, it is not difficult to make the above algorithm work in an `online' environment, where we process the data as it arrives to get a real-time view of how the CAC responds in real-time. What makes this work is the linear structure of the matrix profile: when a new point is ingested, we only need to recompute the pairwise distances to the new point, which gives a massive reduction in time complexity. For the algorithm in real-time, we compute all of the above transformations (Welch, VMD) in batches as data arrives. We then update the matrix profile one point at a time; every 100 points, we compute the updated CAC and save it in an array. At the end, we produce a gif to visualise how the CAC reacts in real-time to perceived anomalous transitions. For the online methodology flowchart, see Figure 7.
Figure 7: Flowchart diagram for online anomaly detection on the three-storeyed building data set. Beginning with a pre-determined starting subset, we perform the offline algorithm (VMD, log-normalise, etc.) to get the initial CACs at three distinct phases of the data process. We then proceed to update the matrix profiles point-by-point, and every 100 points, record CACs to be animated.
In the next section, we present a number of notebooks where you can see the results of each data processing step and the final predictors themselves in an online and offline environment.