Truth or Statistics

Tuesday, August 9, 2016

Simple Example where Statistics Reduces / Erases Data

Average of 2,3,10 is 5.

Standard practices records the 5 and "forgets" the data samples that were utilized to calculate 5.

Initially the "10" sample looks like an outlier.

If next sample is 11 then 10 no longer looks like an outlier.

Associative Memories and SDR - Sparse Data Representations

Associative Memories

Complex domains require humans to help discover new relations, find existing relations or find similar relations within diverse information sources. Information sources might include text, audio, images, video, social networks, simulations, sampled sensor systems (IOT) and/or correlations across all these types. Humans are good correlation engines.

Numenta's, Jeff Hawkins co-founder has written "On Intelligence" a small book separated into three primary sections: (1) theory of brain structure, (2) evidence supporting brain structure theory and (3) applications that can be built if humans follow this design. Jeff Hawkins utilizes the acronym, HTM - Hierarchical Temporal Memory as Numenta's primary data structure that mimics the associative memory correlation capability within human brains. Numenta has open sourced some of it's source code as a means for further exploration of these concepts. Cortical.io is a technical partner to Numenta and has gained considerable experience of their own in memory based applications.

Saffron Tech, a competitor to Numenta utilizes compressed multiple linked matrices to represent their associative memory capabilities. Their api documentation can be seen here:
https://saffrontech.atlassian.net/wiki/display/DOC/Saffron+Documentation+10.x
The Boeing team worked with Saffron Tech over many years to help them create a data driven, web enabled configuration utility that allows for ease of mapping existing data sources into an associative memory:

Utilizing these commercial based infrastructures from my personal experience at Boeing, designing complex, domain specific solutions to help resolve complex problems can be resolved with less effort than traditional software infrastructures and/or compared to "writing your own from scratch" associative memory infrastructure.

An open source, implementation called S-Space for Semantic Space provides a rudimentary implementation for Natural Language Processing.
https://github.com/fozziethebeat/S-Space

Google, of course, also has something like an associative memory embedded within their infrastructure that we utilize on a daily basis when we type in queries and "it" suggests adjacent queries.

Co-mentions within books also reveals interesting relations between characters:

Many implementations utilize Sparse Data Representations

Many software infrastructures now contain SDR data structures including Scala, Python, Java, C, C++, R - you just need to know where to look; that will be another post.

Friday, July 29, 2016

PCA Limitations Illustrated

Principal Component Analysis

PCA is well known method, included in most data mining packages and is inexpensive to compute, requiring only first order statistics sampled mean and variance. The PCA method assumes that parameters and/or features are linearly correlated and that . PCA transforms the observed data into a linearly independent metric space.

https://www.quora.com/Are-there-implicit-Gaussian-assumptions-in-the-use-of-PCA-principal-components-analysis

All methods have limitations, I wish that data scientists would discuss method limitations.
The picture below originates from SciKit Learn community site. I rearranged the panels by moving "True Sources" signal panel to the top, followed by "Observed Samples", then "PCA transformed signals", then "ICA transformed signals".

As can be easily seen the PCA distorts the original signal significantly. ICA (Indpendent Component Analysis) utilizes non-linear, computationally expensive methods to extract something similar to "true source" signals, also described in above Quora explanation of PCA that mentions ICA.

After seeing many signals in my lifetime, in optics, audio, electrical and process control systems, I find that "Observations from mixed signal" resemble a "filtered, under-sampled" representation of the true sources. This is an example of how the attempt to reduce noise in the signal via filters also corrupts the "true sources signal".

Thursday, July 28, 2016

Visualization, Validation and Collaboration is Difficult to Achieve

Oblong Industries has developed an excellent product called "Mezzanine" that helps organizations collaborate using natural gestural interfaces. Oblong founders helped Hollywood generate visuals within the Minority Report by utilizing a system prototype of their flagship product.

http://www.oblong.com/mezzanine/overview/

Anomaly Detection Methods

Anomalies are rare signals from larger sample space; finding recurring anomalies associated with alerts or system errors leads to system improvements and better system designs. For instance in buried deep into SONET (synchronous optical network) protocol if the timing clocks are out of synch between two network interfaces the underlying protocol "drops a frame"; if HDTV video is streaming across the network, the frame loss will cause visual defects to appear on users screens. Similar anomalies can occur in many different types of systems, finding them and then correcting network configurations leads to better overall system designs.

For jet engines anomaly detection and remaining useful life estimation, my team recently adopted a hybrid approach that coupled system model experts, physical system test rigs and data mining / machine learning methods. This permitted a faster discovery process for this complex system and its subsystems.

Traditional means of studying systems require system models to be developed. In reviews with experts system models require iterative refinement on the following stages:

review of system documentation, operations and failure modes,
identify model parameters of interest,
system metrics associated with function, performance and quality
defining model distributions,
model types often drive distribution forms:normal, exponential, wiebull, poison, binomial etc
sampling strategy of parameters,
filter development to separate signal from noise,
filtered signal characterization
iterative signal analysis
applying statistics to observed signals and noise
applying sensitivity analysis to parameters to understand their contributions to signal / noise
simulating systems with perturbations of modeled signals can enhance discovery
finding observations outside the modeled norms
rationalizations between model and observations found outside the model distribution
methods to consider how humans interact with the system and system models
all steps from above with visualization
methods to help system model designers to validate system by inspection

Limitations with the standard approach include:

system designs and experts may not be accessible
organizational issues may disallow access to expertise
system documentation may be incomplete
without expert guidance system operations may not be understood
it may take considerable time to rebuild a team understanding of system dynamics
previous system models may not be readily accessible due to computer system obsolescence
assumptions associated with signal / noise model distributions may be in error
simulations of signal / noise model distributions may lead to false sense of system dynamics
outlying observations may be rare samples from within the larger sample space
outlying observations may be the anomalies that lead to system error
observing more noise may increase observing outlying samples
it takes considerable time and effort to explore single failure modes
large systems contain many coupled subsystems; each subsystem with multiple failure modes

Applications of machine learning / data mining were originally devised as a set of methods to discover or rediscover system dynamics utilizing a "data driven" approach.

Within machine learning / data mining infrastructures a wide variety of methods are in use and embedded within most of the common machine learning infrastructures. Dr Ben Fry, the inventor of processing.org visualization workflow illustrates a similar workflow illustrated below:

Dr Ben Fry data mining workflow

Machine learning / data mining methods are similar to the standard engineering workflow except that the vernacular changes somewhat as a complete system understanding is not necessarily required to get started in discovering system dynamics and associations. If it can be accommodated by the underlying software infrastructure all data can be pairwise correlated in parallel with each other. The benefit of approaching system analysis in this manner is that you may learn and discover non-obvious relationships between the sampled parameters, associative memory infrastructures take this approach. Large distributed matrices are built and redistributed across many nodes within a distributed computing environment.

However many machine learning / data mining methods teach an approach called feature reduction that roughly corresponds to selecting subset of all parameters and filtering signal from noise as a means to reduce computational burdens and still retain sufficient classification performance for the system dynamics.

Principal Component Analysis (PCA) is a common method to determine and reduce features down to a manageable computation.

Image and text below obtained from http://www.nlpca.org/pca_principal_component_analysis.html

"PCA transformation

Principal component analysis (PCA) rotates the original data space such that the axes of the new coordinate system point into the directions of highest variance of the data. The axes or new variables are termed principal components (PCs) and are ordered by variance: The first component, PC 1, represents the direction of the highest variance of the data. The direction of the second component, PC 2, represents the highest of the remaining variance orthogonal to the first component. This can be naturally extended to obtain the required number of components which together span a component space covering the desired amount of variance.
Since components describe specific directions in the data space, each component depends by certain amounts on each of the original variables: Each component is a linear combination of all original variables.

Dimensionality reduction

Low variance can often be assumed to represent undesired background noise. The dimensionality of the data can therefore be reduced, without loss of relevant information, by extracting a lower dimensional component space covering the highest variance. Using a lower number of principal components instead of the high-dimensional original data is a common pre-processing step that often improves results of subsequent analyses such as classification.
For visualization, the first and second component can be plotted against each other to obtain a two-dimensional representation of the data that captures most of the variance (assumed to be most of the relevant information), useful to analyze and interpret the structure of a data set."

Visualization and interpretation is an important stage in "sense-making". PCA both stretches the original data dimensions and chosen top-N features may not represent all necessary features for a particular system under study. In addition, PCA contains the assumption that the features are linearly related in some manner. Most natural are complex and non-linear.

Data Mining Tools include:

The SciKit Learn is an excellent python data mining toolkit. Scikit Learn implementation for PCA and many other methods.

RapidMiner is an excellent java based data mining toolkit that contains a block oriented data mining workflow construction tool and execution environment. Rapid Miner contains implementations for PCA and many other methods.

Scala, Spark, MLlib is a new infrastructure that adds functional programming ideas to a distributed infrastructure and RDDs - a distributed data structure that sits on top of Hadoop ecosystems.

For text data mining and visualization, Elasticsearch and its recent addition of Kibana gives excellent performance and coverage for integrating, indexing and visualizing disparate data sets within the same analysis context.

For even larger data sets I would recommend examining the national and international physics community projects. e.g. CERN ROOT toolkit and its TMVA package for ensemble methods across non-linear parameter sets. In the US I recommend ParaView, VisIt and LLNL software projects. After all "they" found the elusive Higgs Boson by combining these tools in a variety of manners.

Also recommend looking at CERN Level 2 discriminator that figures out Z coordinate associated with collisions utilizing NVida GPUs.

Systems Theory

Many systems fail due to improper formulation at interfaces and/or after encapsulation.

Dr Wayne Wymore founder of University of Arizona's Systems and Industrial Engineering Department.

https://en.wikipedia.org/wiki/A._Wayne_Wymore
http://sysengr.engr.arizona.edu/wymore/WWAutobiography.htm

There are now many different types of "System Theory"
https://en.wikipedia.org/wiki/List_of_types_of_systems_theory

Dr Wymore developed a rigorous mathematical formulation for systems; a system is defined as

System Model = (IZ, OZ, SZ, NZ, RZ)

IZ is set of all inputs
OZ is set of all outputs
SZ is set of all states
NZ is the next state function, NZ(IZ, SZ) --> SZ
In English NZ is a function that maps inputs within context of current states to the next state
RZ is the read / output function, RZ(SZ) --> OZ
In English RZ is a function that maps the current state set to the output set.

Note that each set IZ, OZ, SZ is the set of all inputs, outputs, states across all time. Additionally IZ, OZ, SZ, NZ, RZ at any time instant can be defined by multidimensional vectors or matrices across time.

Wymore's mathematical systems theoretic definition includes how systems can be encapsulated and how to properly consider coupling between systems. Google books and Wymore's original books contain complete discussion on how to study systems in a rigorous mathematical basis.

Many systems fail due to ill-defined behavior of the next state function especially with regard to assumptions that are made at interfaces or after encapsulation.

Wednesday, July 27, 2016

Modeling Systems

Traditional systems engineering methods include some combination of models, inputs, outputs and functional transforms. A simple system diagram is illustrated in Figure 1.

Figure 1: Closed-loop feedback system

Inputs are transformed by the system into outputs. In a feedback system some outputs are "fed backwards" to become new inputs to be transformed by the system at a future time.

The theory of "systems engineering" is summarized at Dr Bahill's University of Arizona webpage; systems engineering is the methodical study of complex systems.

http://sysengr.engr.arizona.edu/whatis/whatis.html

Models of systems are often created to investigate existing system and also to explore alternate designs.

Figure 2: Controlled Closed-loop Feedback System

Figure 2 illustrates a simplified system that includes a controlled feedback loop. Although also a simplified system, a decomposition of the complete system would reveal a set of coupled systems wherein inputs are transformed to outputs and those outputs become inputs to adjacent systems.

From left to right a "disturbance" presumably environmentally caused is part of set of inputs that becomes transformed by the "System" block. In the illustrated system, "status" variables are emitted as outputs from the system block. These "status" variables are utilized as inputs to a "measuring element" whereby "values" are emitted as outputs. These "values" are compared to "set points" that can indicate "errors". The "errors" become inputs to a "Controller" and in turn the "Controller" commands the "Effector" to generate "feedback" values that modifies inputs at the next time delta.

Physically realized systems that utilize feedback control loops with set point values are not instantaneously successful in reaching the "set point / target" because physical systems contain inertia in mass based systems, resistance in electrical systems or attenuation in optics systems. Disturbances might include externally induced vibrations like seismic waves, heat caused by adjacent systems and mismatches in optical coupling between fibers.