Thursday, July 28, 2016

Anomaly Detection Methods


Anomalies are rare signals from larger sample space; finding recurring anomalies associated with alerts or system errors leads to system improvements and better system designs.  For instance in buried deep into SONET (synchronous optical network) protocol if the timing clocks are out of synch between two network interfaces the underlying protocol "drops a frame"; if HDTV video is streaming across the network, the frame loss will cause visual defects to appear on users screens.  Similar anomalies can occur in many different types of systems, finding them and then correcting network configurations leads to better overall system designs.

For jet engines anomaly detection and remaining useful life estimation, my team recently adopted a hybrid approach that coupled system model experts, physical system test rigs and data mining / machine learning methods.  This permitted a faster discovery process for this complex system and its subsystems.

Traditional means of studying systems require system models to be developed.  In reviews with experts system models require iterative refinement on the following stages:

  1.  review of system documentation, operations and failure modes,
  2.  identify model parameters of interest,
  3.  system metrics associated with function, performance and quality
  4.  defining model distributions,
  5.  model types often drive distribution forms:normal, exponential, wiebull, poison, binomial etc  
  6.  sampling strategy of parameters,
  7.  filter development to separate signal from noise, 
  8.  filtered signal characterization
  9.  iterative signal analysis 
  10.  applying statistics to observed signals and noise
  11.  applying sensitivity analysis to parameters to understand their contributions to signal / noise
  12.  simulating systems with perturbations of modeled signals can enhance discovery
  13.  finding observations outside the modeled norms
  14.  rationalizations between model and observations found outside the model distribution
  15.  methods to consider how humans interact with the system and system models
  16.  all steps from above with visualization 
  17.  methods to help system model designers to validate system by inspection

Limitations with the standard approach include:
  1. system designs and experts may not be accessible
  2. organizational issues may disallow access to expertise
  3. system documentation may be incomplete 
  4. without expert guidance system operations may not be understood 
  5. it may take considerable time to rebuild a team understanding of system dynamics
  6. previous system models may not be readily accessible due to computer system obsolescence
  7. assumptions associated with signal / noise  model distributions may be in error
  8. simulations of signal / noise model distributions may lead to false sense of system dynamics
  9. outlying observations may be rare samples from within the larger sample space
  10. outlying observations may be the anomalies that lead to system error
  11. observing more noise may increase observing outlying samples
  12. it takes considerable time and effort to explore single failure modes
  13. large systems contain many coupled subsystems; each subsystem with multiple failure modes
Applications of machine learning / data mining were originally devised as a set of methods to discover or rediscover system dynamics utilizing a "data driven" approach.

Within machine learning / data mining infrastructures a wide variety of methods are in use and embedded within most of the common machine learning infrastructures. Dr Ben Fry, the inventor of processing.org visualization workflow illustrates a similar workflow illustrated below:


Dr Ben Fry data mining workflow

Machine learning / data mining methods are similar to the standard engineering workflow except that the vernacular changes somewhat as a complete system understanding is not necessarily required to get started in discovering system dynamics and associations.  If it can be accommodated by the underlying software infrastructure all data can be pairwise correlated in parallel with each other.  The benefit of approaching system analysis in this manner is that you may learn and discover non-obvious relationships between the sampled parameters, associative memory infrastructures take this approach. Large distributed matrices are built and redistributed across many nodes within a distributed computing environment. 
   
However many machine learning / data mining methods teach an approach called feature reduction that roughly corresponds to selecting subset of all parameters and filtering signal from noise as a means to reduce computational burdens and still retain sufficient classification performance for the system dynamics.

Principal Component Analysis (PCA) is a common method to determine and reduce features down to a manageable computation.     

Image and text below obtained from http://www.nlpca.org/pca_principal_component_analysis.html

"PCA transformation
Principal component analysis (PCA) rotates the original data space such that the axes of the new coordinate system point into the directions of highest variance of the data. The axes or new variables are termed principal components (PCs) and are ordered by variance: The first component, PC 1, represents the direction of the highest variance of the data. The direction of the second component, PC 2, represents the highest of the remaining variance orthogonal to the first component. This can be naturally extended to obtain the required number of components which together span a component space covering the desired amount of variance.
Since components describe specific directions in the data space, each component depends by certain amounts on each of the original variables: Each component is a linear combination of all original variables.

Dimensionality reduction

Low variance can often be assumed to represent undesired background noise. The dimensionality of the data can therefore be reduced, without loss of relevant information, by extracting a lower dimensional component space covering the highest variance. Using a lower number of principal components instead of the high-dimensional original data is a common pre-processing step that often improves results of subsequent analyses such as classification.
For visualization, the first and second component can be plotted against each other to obtain a two-dimensional representation of the data that captures most of the variance (assumed to be most of the relevant information), useful to analyze and interpret the structure of a data set."

Visualization and interpretation is an important stage in "sense-making".  PCA both stretches the original data dimensions and chosen top-N features may not represent all necessary features for a particular system under study.  In addition, PCA contains the assumption that the features are linearly related in some manner.  Most natural are complex and non-linear.  

Data Mining Tools include:
The SciKit Learn is an excellent python data mining toolkit.  Scikit Learn implementation for PCA and many other methods.

RapidMiner is an excellent java based data mining toolkit that contains a block oriented data mining workflow construction tool and execution environment.  Rapid Miner contains implementations for PCA and many other methods.

Scala, Spark, MLlib is a new infrastructure that adds functional programming ideas to a distributed infrastructure and RDDs - a distributed data structure that sits on top of Hadoop ecosystems.

For text data mining and visualization, Elasticsearch and its recent addition of Kibana gives excellent performance and coverage for integrating, indexing  and visualizing disparate data sets within the same analysis context.

For even larger data sets I would recommend examining the national and international physics community projects.  e.g. CERN ROOT toolkit and its TMVA package for ensemble methods across non-linear parameter sets.  In the US I recommend ParaView, VisIt and LLNL software projects.  After all "they" found the elusive Higgs Boson by combining these tools in a variety of manners.

Also recommend looking at CERN Level 2 discriminator that figures out Z coordinate associated with collisions utilizing NVida GPUs.   

No comments:

Post a Comment