Practical Guide to Gaussian Processes: Difference between revisions

From testwiki
Jump to navigation Jump to search
imported>Physikinger
Undid revision 4383777 by Kittycataclysm (discuss) I understand the intention behind Dewikify, but currently it's no improvement while the information is not yet available in wikibooks.
Β 
(No difference)

Latest revision as of 22:05, 30 June 2024

Preface

This book gives an introduction to Gaussian processes and shows their various, but not complete, applications. The introduction is aimed for users who want to apply the technique to solve practical engineering problems. With application examples, it shows how Gaussian processes can be used for machine learning to infer from known to unknown situations. The book serves as a reference for common analytical representations of Gaussian processes and for mathematical operations and methods in specific use cases.

Introduction

A Gaussian process is a stochastic process with the property that every finite subset of its values is multivariate normally distributed (or Gaussian distributed). A stochastic process is a function whose values are random variables and which follow a given probability distribution. This allows to model functions with probabilities whose values cannot be completely determined due to a lack of information. A Gaussian process is constructed from functions of mean values, variances and covariances and thus describes the function values as a continuum of correlated random variables in the form of an infinite-dimensional normal distribution. The distribution of a Gaussian process can be imagined as a probability distribution of functions. A sample of it yields a random function with certain preferred properties of its curve shape.

Applications

Gaussian processes are used for mathematical modeling of the behavior of non-deterministic systems on the basis of stochastic quantities or observations. Gaussian processes are suitable for signal analysis and synthesis, form a powerful tool for the interpolation, extrapolation, or smoothing of arbitrarily dimensional discrete measurement points (Gaussian process regression or kriging), and find application in classification problems. Gaussian processes, which are related to kernel methods,[1] can be used as a supervised machine learning technique for abstract modeling based on training examples. This Bayesian approach to machine learning has the advantage that it often does not require iterative training as neural networks do. Instead, Gaussian processes can be derived very efficiently with linear algebra from statistical quantities of the examples and are mathematically clearly interpretable and well controllable. Moreover, for interpolations and predictions, an associated confidence interval is computed for each individual output value, which accurately estimates its own prediction error, while correctly accounting for error propagation when the variance of the input values is known.

Mathematical Description

Definition

A Gaussian process is a special type of stochastic process (Xt)tT on any index set T, if its finite-dimensional distributions are multivariate normal distributions (also Gaussian distributions) for all t1,t2,,tnT. That is, the multivariate distribution of (Xt1,Xt2,,Xtn) is given by an n-dimensional normal distribution.

Term: Even the term Gaussian process may indicate temporal or sequential processes, this restriction does not exist. In a generalized sense, process can be understood as a continuum.

Notation

In analogy to the one- and multidimensional Gaussian distribution, a Gaussian process is completely and uniquely determined by its first two moments. In the multidimensional Gaussian distribution, these are the expected value vector or mean vector μβ†’ and the covariance matrix σ. For the description of Gaussian process, these are replaced by an expected value function or mean function

m(t):=𝔼(Xt),tT

and a covariance function

k(t,t):=Cov(Xt,Xt):=𝔼[(Xtm(t))(Xtm(t))],t,tT.

These functions can be understood in the simplest one-dimensional case as a vector with continuous rows and as a matrix with continuous rows and columns. The following table compares one-dimensional and multidimensional Gaussian distributions with Gaussian processes. The tilde symbol can be read as "is distributed as".

Distribution type Notation Variables Probability density function
Univerate normal distribution X𝒩(μ,σ2) X,μ,σℝ p(x)=1σ2πexp{12(xμ)2/σ2}
Multivariate normal distribution X→𝒩n(μβ†’,Σ) Xβ†’,μ→ℝn;Σℝn×n p(xβ†’)=1(2π)n2|Σ|12exp{12(xβ†’μβ†’)TΣ1(xβ†’μβ†’)}
Gaussian process distribution (Xt)tT𝒒𝒫(m,k) m:Tℝ
k:T×Tℝ
(no analytical representation)

The probability density function of a Gaussian process cannot be represented analytically because there is no corresponding notation for operations with continuous matrices. This gives the impression that one cannot perform computations with Gaussian processes in the same way as with finite-dimensional normal distributions. However, the essential property of the Gaussian process is not the infinity of the dimensions, but rather the assignment of the dimensions to the coordinates of a function. In practical applications, one always has to deal with a finite number of interpolation points and can therefore perform all calculations as in the finite-dimensional case. The limit to infinitely many dimensions is only needed in an intermediate step, namely if values are to be read out at new interpolated grid points. In this intermediate step, the Gaussian process, i.e. the mean function and covariance function, is represented or approximated by suitable analytical expressions. In this case the assignment to the grid points is done via the parameterized coordinates t in the analytical expression. In the finite-dimensional case with discrete grid points the associated coordinates ti are assigned to the dimensions by their indices.

Example of a Gaussian process

As a simple real world example, consider a Gaussian process

(Xt)tT𝒒𝒫(m(t),k(t,t))

with a scalar variable t (time), given by the mean function

m(t)=5Volt

and covariance function

k(t,t)={(1Volt)2t=t0tt

This Gaussian process describes an endless temporal electrical signal with Gaussian white noise with a standard deviation of one volt centered around a mean voltage of 5 volts.

Definitions of special properties

A Gaussian process is called centered if its expected value or mean is constantly 0, that is, if m(t):=𝔼(Xt)=0 for all tT.

A covariance function k(t,t):=Cov(Xt,Xt) is called stationary when it is translation invariant, that is, it can be described by a relative function k(t,t)=k(tt).[2]

A Gaussian process is called stationary (or translation invariant) if its covariance function is stationary and its mean is constant.[3]

A covariance function is called radial when the function k(t,t)=k(|tt|) is radial symmetric with a one-dimensional parameter using the Euclidean norm ||. It is used to describe systems with isotropic model properties.

List of Common Gaussian Processes and Covariance Functions

  • Constant: m(t)=0 and k(t,t)=σ2
Corresponds to a constant value from a Gaussian distribution with standard deviation σ.
  • Offset: m(t)=c and k(t,t)=0
Corresponds to a constant value given by c.
  • Gaussian White noise: k(t,t)=σ2δt,t
(σ: standard deviation, δ: Kronecker delta)
  • Rational quadratic: k(r)=(1+r2)α,α0
  • Gamma-exponential: k(r)=exp((r)γ)
  • Ornstein-Uhlenbeck:[4] k(r)=exp(r)
Corresponds to a simple Gauss-Markov process and describes continuous, non-differentiable functions, as well as white noise after passing through an RC low-pass filter.
  • Squared exponential: k(r)=exp(r222)
Describes infinitely smooth differentiable functions.
kν=p+1/2(r)=exp(2νr)Γ(p+1)Γ(2p+1)i=0p(p+i)!i!(pi)!(8νr)pi
A highly versatile Gaussian process used to describe most typical measurement curves. The functions of the Gaussian process are n times continuously differentiable if ν>n. Covariance functions with ν=1/2, 3/2, 5/2, etc. correspond to white noise that has passed through 1, 2, or 3 RC low-pass filters or has been convolved with the function exp(|x|). Common special cases include:
kν=3/2(r)=(1+3r)exp(3r)
kν=5/2(r)=(1+5r+5r232)exp(5r)
kν=1/2(r) corresponds to the Ornstein-Uhlenbeck covariance function, and kν(r) corresponds to the squared exponential function.
  • Periodic: k(r)=exp(2sin2(πrT)2)
Functions from this Gaussian process are both periodic with period T and smooth (squared exponential). If the square around the sine is replaced by the absolute value, non-smooth periodic functions result.
  • Polynomial: k(t,t)=(tt+σ02)p
Grows rapidly outward and is usually a poor choice for regression problems, but can be useful in high-dimensional classification problems. It is positive semidefinite and does not necessarily generate invertible covariance matrices.[6]
Corresponds to the Brownian motion or integral over Gaussian white noise.
  • Itō process: If T=ℝ+ and f, g are two integrable real-valued functions and (Wt) is a Wiener process, then the Ito process
Xt=0tf(s)ds+0tg(s)dWs
is a Gaussian process with m(t)=0tf(s)ds and k(t,t)=0min(t,t)g2(s)ds.

Remarks:

  • r:=tt is the distance for stationary and radial covariance functions k(t,t)=k(r).
  • is the characteristic length scale of the covariance function where the correlation has decayed to about e1.
  • Most stationary covariance functions k(r) are normalized to k(0)=1 and are therefore equivalent to correlation functions. For use as covariance functions, they are multiplied by a variance σ2, which assigns the variables a scaling and/or physical unit.
  • Covariance functions cannot be arbitrary functions k(r) or k(t,t), as it must be ensured that they are positive definite.[7] Positive semidefinite functions are also valid covariance functions, but it should be noted that these do not necessarily result in invertible covariance matrices and are therefore usually combined with a positive definite function.

Mathematical operations with Gaussian processes

Gaussian processes (or normal distributions) can be used to perform various stochastic operations that allow different functions with normally distributed errors to be joined or extracted from each other. If there are cross-correlations between the functions, it is assumed that they follow a joint normal distribution. In signal processing, for example, the operations are used to handle temporal signals and their measurement uncertainties. The distributions of these functions are described in the following operations in vector and matrix notation for finitely many interpolation points y𝒩(μ,Σ), which analogously applies to arbitrary mean functions m(t) and covariance functions k(t,t). The normally distributed vectors (y1, y2 etc.) are described as functions accordingly.

Linear transformation

Addition: uncorrelated functions

If the sum of two independent (and especially uncorrelated) functions is built, then their mean functions and their covariance functions add up:

y1+y2𝒩(μ1,Σ1)+𝒩(μ2,Σ2)=𝒩(μ1+μ2,Σ1+Σ2).

The associated probability density functions thereby undergo a convolution.

Addition: correlated functions

Correlated functions can in an extreme case be identical or differ only by constant factors. The sum then corresponds to a multiplication with the added factors. If both functions are identical, the result is y+y=2y𝒩(2μ,4Σ).

Difference: uncorrelated functions

If the difference of two independent uncorrelated functions is built, then their mean functions are subtracting while their covariance functions are adding:

y1y2𝒩(μ1,Σ1)𝒩(μ2,Σ2)=𝒩(μ1μ2,Σ1+Σ2).

Subtraction of a Correlated Component

If the function y2 of a Gaussian process describes a additive component y1 of another Gaussian process, then subtracting this component results in the subtraction of the mean function and covariance function:

y1y2𝒩(μ1,Σ1)𝒩(μ2,Σ2)=𝒩(μ1μ2,Σ1Σ2)

The backslash operator was symbolically used here in the sense of "without the contained component".

Multiplication

The following multiplication with an arbitrary matrix F also includes the special cases of the product with a function (diagonal matrix F) or with a scalar (F=c𝕀):

FyF𝒩(μ,Σ)=𝒩(Fμ,FΣF)

It should be noted here that the product of the functions of two Gaussian processes with each other would not result in another Gaussian process, since the resulting probability distribution would have lost the property of being Gaussian or normal.

General linear transformation

All previously shown operations are special cases of the general linear transformation:

A𝒩(μ1,Σ1)+B𝒩(μ2,Σ2)=𝒩(Aμ1+Bμ2,AΣ1A+BΣ2B+AΣ12B+BΣ12A)

This relation[8] describes the sum Ay1+By2 with constant matrices A and B and the support point vectors y1 and y2 of the functions of two Gaussian processes with y1𝒩(μ1,Σ1) and y2𝒩(μ2,Σ2). For partially correlated functions y1 and y2, the cross-covariance matrix Σ12 must be given and all variables must be jointly normal (i.e. the must follow a common multivariate normal distribution) as a precondition. In such case the sum Ay1+By2 is correlated with y1 by the cross-covariance matrix AΣ1+BΣ12 and with y2 by AΣ12+BΣ2.[9] A cross-covariance matrix ΣXY between two functions X and Y can be converted into a cross-correlation matrix CXY using their covariance matrices ΣX and ΣY through the relation [CXY]ij=[ΣXY]ij/[ΣX]ii[ΣY]jj. In the case of two partially correlated Gaussian processes, it should be noted that special dependencies may exist where the sum does not result in a normal distribution and the equation accordingly loses its validity, although both input quantities are normally distributed.

Fusion

If the same unknown function is described by two different Gaussian processes with uncorrelated errors to each other, then a union or fusion (also Sensor fusion) of the two parts of partial information can be formed to achieve a reduction of the error or variance. For example, in signal processing, the same waveform is measured by two different sensors (such as the trajectory of an aircraft by an inertial sensor and independently by a GNSS location determination), which add up two different independent noise or error signals. The joint distribution

ΣFusion=(Σ11+Σ21)1
μFusion=ΣFusionΣ11μ1+ΣFusionΣ21μ2

corresponds to the overlap or the normalized product of the two probability density functions and describes the most likely Gaussian process taking into account both parts of information (see also Inverse-variance weighting). The expressions can also be rearranged,[10] such that only one matrix inversion needs to be performed:

μFusion=μ1Σ1(Σ1+Σ2)1(μ1μ2)=Σ2(Σ1+Σ2)1μ1+Σ1(Σ1+Σ2)1μ2
ΣFusion=Σ1Σ1(Σ1+Σ2)1Σ1=Σ1(Σ1+Σ2)1Σ2

The validity of the formula requires function pairs with entirely uncorrelated errors. However, if there is partial correlation with cross-covariance Σ12, then the extended and generalized formula, the so-called Bar-Shalom-Campo fusion, applies,[11] where the correlated part is temporarily subtracted and then added back after fusion:

μFusion=μ1(Σ1Σ12)(Σ1+Σ2Σ12Σ21)1(μ1μ2)
ΣFusion=Σ1(Σ1Σ12)(Σ1+Σ2Σ12Σ21)1(Σ1Σ21)

Decomposition

A given function ysum can be approximately decomposed into its additive components when the prior distributions of the entire function and the components are given. According to the addition rule, the Gaussian process of the entire function

μsum=μ1++μn
Σsum=Σ1++Σn

is composed of the prior distributions of the components. The individual components yi can then be estimated by the posterior Gaussian processes

μpost,i=μi+ΣiΣsum1(ysumμsum)
Σpost,i=ΣiΣiΣsum1Σi

which are correlated to each other by the cross covariances

Σpost,i,j=ΣiΣsum1Σj.

Apart from very specific cases, this decomposition is ambiguous. The components are therefore coupled probability distributions of possible solutions around the most likely components (see also Example: Signal Decomposition).

The decomposition is based on the equations for fusion in the previous section, which are applied to the specific distributions 𝒩(μsum,Σsum) and 𝒩(μi,Σi). The density product or overlap extracts the corresponding component in this case.[12]

Gaussian process regression

Introduction

Gaussian processes can be used to interpolate, extrapolate, or smooth discrete measurement data of a mapping ℝnℝ. This application of Gaussian processes is called Gaussian process regression. The method is often called kriging for historical reasons, especially in the spatial domain. It is particularly suitable for problems for which no specific model function is known. Its property as a machine learning method allows automatic model building based on observations. In this application, a Gaussian process captures the typical behavior of the system, which can be used to derive the optimal interpolation for the problem. The result is a probability distribution of possible interpolation functions and the solution with the highest probability.

Overview of the individual steps

The calculation of a Gaussian process regression can be performed by the following steps:

  1. Prior mean function: If there is a consistent trend in the measured values, a prior mean function is constructed to equalize the trend.
  2. Prior covariance function: The covariance function is selected according to certain qualitative properties of the system or composed from covariance functions of different properties according to certain rules.
  3. Fine-tuning of parameters: to obtain quantitatively correct covariances, the selected covariance function is adjusted to the available measured values either targeted or by an optimization procedure until the covariance function reflects the empirical covariances.
  4. Conditional distribution: By considering known measured values, the conditional posterior Gaussian process is calculated from the prior Gaussian process for new support points with still unknown values.
  5. Interpretation: Finally, from the posterior Gaussian process, the mean function is taken as the best possible interpolation and, if required, the diagonal of the covariance function is taken as the location-dependent variance.

Step 2: Prior covariance function

In practical applications, a Gaussian process must be determined from finitely many discrete measured values or finitely many sample curves. In analogy to the one-dimensional Gaussian distribution, which is completely determined by the mean and standard deviation of discrete measured values, one would expect several single but complete functions fi(t) in order to calculate the mean function

m(t)=1Ni=1Nfi(t)

and the (empirical) covariance function

k(t,t)=1N1i=1N[fi(t)m(t)][fi(t)m(t)].

Regression problem and stationary covariance

Often, however, no such distribution of exemplary functions is available. In the regression problem instead only discrete interpolation points of a single function are known, which are to be interpolated or smoothed. Also in such a case a Gaussian process can be determined. For this purpose, instead of this single function, a set of many copies of the function shifted to each other is considered. This distribution can now be described with the help of a covariance function. Usually it can be expressed as a relative function of this shift by k(t,t)=k(tt). It is then called stationary covariance function and applies equally to all locations of the function and describes the everywhere equal (thus stationary) correlation of each point to its neighborhood, as well as the correlation of neighboring points among each other.

The covariance function is represented analytically and determined heuristically or looked up in the literature. The free parameters of the analytical covariance functions are fitted to the measured values. Many physical systems have a similar form of the stationary covariance function, so that with a few tabulated analytical covariance functions most applications can be described. For example, there are covariance functions for abstract properties such as smoothness, roughness (lack of differentiability), periodicity or noise, which can be combined and fitted according to certain rules to reproduce the properties of the measured values.

Examples of stationary covariance

The following table shows examples of covariance functions with such abstract properties. The example curves are random samples of the respective Gaussian process and represent typical function shapes. They were generated with the corresponding covariance matrix Σij=k(ti,tj) and a random generator for multidimensional normal distributions as correlated random vector. The stationary covariance functions k(t,t) are abbreviated here as one-dimensional functions k(r) with r:=|tt|.

Properties Examples of stationary covariance functions Random functions f(t)
Constant k(r)=1
Smooth k(r)=exp(r2/5)
Rough k(r)=exp(r/15)
Periodic k(r)=exp(|sin(0,4πr)|/2.5)
Noise k(r)={0.2:r=00:r0
Mixed
(periodic,
smooth and
noisy)
k(r)=exp(sin2(π2r)/4r2/40)+{0.005:r=00:r0

Construction of new covariance functions

The properties can be combined according to certain computational rules. The basic goal in constructing a covariance function is to reproduce the true covariances as precisely as possible, while at the same time satisfying the condition of positive definiteness. The examples shown, except for the constant, have the latter property, and the additions and multiplications of such functions also remain positive definite. The constant covariance function is only positive semidefinite and must be combined with at least one positive definite function. The lowest covariance function in the table shows a possible mixture of different properties. The functions in this example are periodic over a certain distance, have a relatively smooth behavior and are overlaid with a certain measurement noise.

For mixed properties, the following rules applies:[13]

  • In the case of additive effects, the covariances are added, as for example in the superposition of measurement noise.
  • For reinforcing or mitigating effects to each other, the covariances are multiplied, such as in case of the slow decay of periodicity.

Multidimensional functions

What is shown here with one-dimensional functions can be transferred analogously also to multi-dimensional systems, by simply replacing the distance r by a corresponding n-dimensional distance norm. The support points in the higher dimensions are unrolled in an arbitrary order and represented by vectors, so that they can be processed in the same way as in the one-dimensional case. The following two figures show two examples with two-dimensional Gaussian processes and different stationary and radial covariance functions. In the respective right figure a random draw of the Gaussian process is shown.

ZufÀllige Stichprobe eines 2D-Gaußprozesses mit absolut-exponentieller radialer Kovarianzfunktion. ZufÀllige Stichprobe eines 2D-Gaußprozesses mit quadratisch-exponentieller radialer Kovarianzfunktion.

Non-stationary covariance functions

Gaussian processes can also have non-stationary properties of the covariance function, that is, relative covariance functions that change as a function of location. The literature describes how nonstationary covariance functions can be constructed so that positive definiteness is ensured here as well. A simple possibility is, for example, an interpolation of different covariance functions over the location with the inverse distance weighting.

Step 3: Fine tuning of parameters

The qualitatively constructed covariance functions contain parameters, called hyperparameters, which must be tuned (or calibrated) to the system in order to obtain quantitatively correct results. This can be done by direct knowledge about the system, e.g., the known value of the standard deviation of the measurement noise or the prior standard deviation of the overall system (sigma prior, the square corresponds to the diagonal elements of the covariance matrix).

However, the parameters can also be adjusted automatically. For this purpose, one uses the marginal likelihood, i.e., the probability density for a given measured curve as a metric for the agreement between the assumed Gaussian process and the existing measured curve. The parameters are then optimized to maximize this agreement. Since the exponential function is strictly monotone, it is sufficient to maximize the exponent of the probability density function, the so-called log-marginal likelihood function[14]

logp(𝐲)=12𝐲Σ1𝐲12log|Σ|n2log(2π)

with the measurement vector 𝐲 of length n and the hyperparameter-dependent covariance matrix Σ. Mathematically, maximizing the marginal likelihood causes an optimal tradeoff between accuracy (minimizing the residuals) and simplicity of the theory. A simple theory is characterized by large non-diagonal elements, describing a high correlation in the system. This means that there are few degrees of freedom in the system and thus, in some sense, the theory can cope with few rules to explain all correlations. If these rules are chosen too simple, the measurements would not be reproduced sufficiently well and the residual errors grow too much. At a maximum marginal likelihood, the equilibrium of an optimal theory is found, provided that sufficiently many measurement data were available for good conditioning. This implicit property of the maximum likelihood estimation can also be understood as Ockham's parsimony principle.

Step 4: Conditional Gaussian process with known support points

If the Gaussian process of a system has been determined as described above, i.e. if the prior mean function and covariance function are known, a prediction of arbitrary interpolated intermediate values can be computed with the Gaussian process, when only a few support points of the desired function are known by measurements. The prediction is done by the conditional probability of a multidimensional Gaussian distribution given a partial information. The dimensions of the multidimensional Gaussian distribution

X=(XUXK)𝒩((μUμK),(ΣUUΣUKΣKUΣKK))

are divided into unknown values to be predicted (index U for unknown) and known measured values (index K for known). Vectors thereby decompose into two parts. The covariance matrix is accordingly divided into four blocks: Covariances within the unknown values (UU), within the known measured values (KK) and covariances between the unknown and known values (UK and KU). The values of the covariance matrix are taken at discrete points of the covariance function and the mean vector at corresponding points of the mean function: Σij=k(ti,tj) or μi=m(ti).

By considering the known measured values XK, the distribution changes to the conditional or posterior normal distribution

XUXK𝒩(μU+ΣUKΣKK1(XKμK),ΣUUΣUKΣKK1ΣKU),

where XU are the unknown variables to be determined. The notation XK reads as "given XK", which means under the condition that XK is given.

The first parameter of the resulting Gaussian distribution describes the new mean vector we are looking for, which now corresponds to the most likely function values of the interpolation. In addition, the entire predicted new covariance matrix is given in the second parameter. In particular, this contains the confidence intervals of the predicted mean values, given by the root of the main diagonal elements.

Measurement noise and other interfering signals

White measurement noise of variance σnoise2 can be modeled as part of the prior covariance model by adding appropriate terms to the diagonal of ΣKK. If the same covariance function is also used to form the matrix ΣUU, the predicted distribution would also describe a white noise of variance σnoise2. To obtain a prediction of an noise free signal, in the posterior distribution

XUXK𝒩(μU+ΣUK[ΣKK+𝕀σnoise2]1(XKμK),ΣUUΣUK[ΣKK+𝕀σnoise2]1ΣKU)

the corresponding terms are omitted at ΣUU and if applicable in ΣUK and ΣKU. This averages out the measurement noise as good as possible, which is also correctly accounted for in the predicted confidence interval. In the same way, any unwanted additive noise signal can be removed from the measurement data (see also arithmetic operation decomposition), provided that it can be described by a covariance function and is sufficiently well distinguishable from the desired signal component. For this purpose, instead of the diagonal matrix 𝕀σnoise2, the corresponding covariance matrix of the interference Σnoise is used. Measurements with noisy signals thus require two covariance models: k(t,t) for the desired signal component to be estimated and k(t,t)+knoise(t,t) for the raw signal.

Derivation of the conditional distribution

The derivation can be done via the Bayes formula by substituting the two probability densities for known and unknown support points and the composite probability density. The resulting conditional posterior normal distribution corresponds to the overlap or intersection of the Gaussian distribution with the subvector space spanned by the known values.

For noisy measurements that are themselves a multidimensional normal distribution, the overlap to the prior distribution is obtained by multiplying the two probability densities. The product of the probability densities of two multidimensional normal distributions corresponds to the arithmetic operation Fusion, which can be used to derive the distribution where the noise is suppressed.

Posterior Gaussian process

In the full notation as a Gaussian process, the posterior Gaussian process yields

(Xt)𝒒𝒫(m,k)

and the n known measurements 𝐱=(x1,x2,,xn) at the coordinates 𝐭=(t1,t2,,tn) a new distribution, given by the conditional posterior Gaussian process

(Xt𝐭,𝐱)𝒒𝒫(mpost,kpost)

Here, K is a covariance matrix obtained by evaluating the covariance function k at discrete rows ti and columns tj. Moreover, 𝐀 was appropriately formed as a vector of functions by evaluating k only at discrete rows or only at discrete columns.

In practical numerical calculations with finite numbers of support points, only the equation of the conditional multivariate normal distribution is used. The notation of the posterior Gaussian process serves here only the theoretical understanding, in order to describe the limit towards the continuum in the form of functions and thus to depict the assignment of the values to the coordinates.

Step 5: Interpretation

From the prior Gaussian process, the measured values are used to obtain a posterior Gaussian process, which takes into account the known partial information. This result of the Gaussian process regression represents not only one solution, but the entirety of all possible solution functions of the interpolation weighted with different probabilities. The indecision expressed in this way is not a weakness of the method. It does perfect justice to the problem, since in the case of a theory which is not completely known or in the case of noisy measurements, the solution, in principle, cannot be determined unambiguously. Mostly, however, we are specifically interested in at least the solution with the highest probability. This is given by the mean function mpost(t) in the first parameter of the posterior Gaussian process. From the conditional covariance function in the second parameter, it is possible to obtain the scatter around this solution. The diagonal kpost(t,t) of the covariance function gives a function with the variances of the predicted most likely function. The confidence interval is then given by the bounds mpost(t)±kpost(t,t).

The Python code for the examples can be found on the respective image description page.

Special cases

Underdetermined measurments

In some cases of conditional Gaussian processes, groups of linearly related measured values are completely indeterminate. E.g., this is the case for indirect measurements following from underdetermined equations, such as with a noninvertible positive semidefinite matrix of the form AΣ1A. The grid points then cannot be easily partitioned into known and unknown values, and the associated covariance matrix would be singular due to infinite uncertainties. This would correspond to a normal distribution that is infinitely stretched in certain spatial directions transverse to the coordinate axes. To account for the relationships between the undetermined variables, in such a case, the inverse matrix Σ21, called the precision matrix, must be used. This can describe completely undetermined measurements, which is expressed by zeros in the diagonal. For such a singular distribution 𝒩(μ2,Σ2) with partially unknown measurements μ2 and singular measurement uncertainties Σ2, the wanted posterior distribution is obtained by the overlap to the prior Gaussian process model 𝒩(μ1,Σ1) calculated by multiplying the probability densities. The union of the two normal distributions

ΣFusion=(𝕀+Σ1Σ21)1Σ1
μFusion=(𝕀+Σ1Σ21)1μ1+ΣFusionΣ21μ2

is obtained by the Fusion operation after appropriate transformation, so that the singular of the two matrices remains inverse. The result is always a finite distribution, since the finite matrix dominates. If both are finite, the equation can be put into the form of the posterior Gaussian process as in the section on the conditional distribution.

Linear combination to a Gaussian process

From given basis functions φj(t) a linear combination is to be formed, which has maximum overlap with the distribution 𝒩(μ,Σ) of an associated Gaussian process 𝒒𝒫(m,k). Or measured values μ are to be approximated, while the interfering signal 𝒩(0,Σ), contained within, is ignored as far as possible. In both cases, the wanted coefficients can be calculated using generalized least squares estimation

c=(AΣ1A)1AΣ1μ
Σc=(AΣ1A)1

The matrix Aij=φj(ti) contains the function values of the basis functions φj(t) at the interpolation points ti. The resulting coefficients c with the associated covariance matrix Σc describe the linear combination with the largest possible probability density in the distribution 𝒩(μ,Σ). The linear combination thereby approximates the mean function or the measured values μ in such a way that the residuals are best described by the covariance matrix Σ. The method is used, for example, in the program library Scikit-learn to empirically estimate a polynomial mean function of a Gaussian process.

Approximation of an empirical Gaussian process

An empirically determined Gaussian process

m(t)=1Np=1Nfp(t)
k(t,t)=1N1p=1N[fp(t)m(t)][fp(t)m(t)]

from example functions fp(t) with few distinct degrees of freedom can be approximated and simplified by means of the eigenvalue decomposition or singular value decomposition

Σ=VSV

of the covariance matrix Σij=k(ti,tj). This is done by choosing the n largest eigenvalues or singular values λp=σp2 from the diagonal matrix S. The corresponding columns vp of V are the principal components of the Gaussian process (see Principal Component Analysis). If the columns are represented as functions vp(t), then the original Gaussian process is represented by the mean function m(t) and the covariance function

k(t,t)p=1nσp2vp(t)vp(t)

This Gaussian process describes exclusively functions of the linear combination

f(t)=m(t)+pcpvp(t),

where each coefficient cp is scattered around zero mean as an independent random variable of variance σp2=λp.

Such a simplification is positively semidefinite and it usually lacks the properties to describe small-scale variations. These properties can be added to the covariance function in the form of a stationary covariance function fitted to the residuals:

k(t,t)p=1nσp2vp(t)vp(t)+kstat(tt)

Application examples

Example: Trend prediction

In a hypothetical application example from market research, the future demand for the topic "snowboard" is to be predicted. For this purpose, an extrapolation of the number of Google searches[15] on this term is to be calculated.

In the past data, one can see a periodic, but non-sinusoidal seasonal dependence, which can be explained by the winter in the northern hemisphere. Moreover, the trend decreased continuously over the last decade. In addition, one recognizes a recurring increase in search queries during the Olympic Games every four years. The covariance function was therefore modeled with a slow trend and a one- and four-year period:

k(r)=0,8exp(12|sin(πr)||r/25|22,5)+(0,20,01)exp(|sin(14πr)|/0,2)+0,01exp(r/45)

The trend also appears to have a significant asymmetry. This can be the case if the underlying random effects do not add up but reinforce each other, resulting in a Log-Normal Distribution. However, the logarithm of such values describes a normal distribution, to which Gaussian process regression can be applied.

Gaussian process regression for Google trend statistics for the search term "snowboard"

The figure shows an extrapolation of the curve (to the right of the dashed line). Since the results here were transformed back from the logarithmic plot using an exponential function, the predicted confidence intervals are correspondingly asymmetrical (gray area). The extrapolation plausibly shows the seasonal patterns and also the increase in searches for the Olympic Games every four years. The example with mixed properties demonstrates very well the versatile modeling possibilities of the Gaussian process regression, which are unified in a single interpolation method.

Python source code of the example calculation

Example: Sensor calibration

In an application example from industry, sensors are to be calibrated using Gaussian processes. Due to tolerances during manufacturing, the characteristic curves f(x) of the sensors show large individual differences. This causes high costs in calibration, since a complete characteristic curve would have to be measured for each sensor. However, the effort can be minimized by learning the exact behavior of the scattering by a Gaussian process. For this purpose, the complete characteristic curves fi(x) of N randomly selected representative sensors are measured and thus the Gaussian process 𝒒𝒫(m,k) of the scattering is calculated by

m(x)=1Ni=1Nfi(x)
k(x,x)=1N1i=1N[fi(x)m(x)][fi(x)m(x)]

In the example shown, 15 representative characteristic curves are given. The resulting Gaussian process is represented by the mean function m(x) and the confidence interval m(x)±k(x,x).

15 complete representative characteristic curves randomly selected to calculate a Gaussian processPrior Gaussian process: mean function and confidence interval of characteristic curves

With the conditional Gaussian process 𝒒𝒫(mpost,kpost) with

mpost(x)=m(x)+𝐀(x,𝐱)K(𝐱,𝐱)1(𝐲m(𝐱))
kpost(x,x)=k(x,x)𝐀(x,𝐱)K(𝐱,𝐱)1𝐀(𝐱,x)

the complete characteristic map can now be reconstructed for each new sensor with a few individual measured values 𝐲 at the coordinates 𝐱. The number of measured values must correspond at least to the number of degrees of freedom of the tolerances, which have an independent linear influence on the shape of the characteristic curve.

In the example shown, a single measured value is not yet sufficient to determine the characteristic curve unambiguously and precisely. The confidence interval shows the region of the curve which is not yet sufficiently accurate. With another measured value in this range, the remaining uncertainty can finally be completely eliminated. The exemplary fluctuations of the very differently acting sensors in this example thus seem to be caused by the tolerances of only two relevant inner degrees of freedom.

Calibration of a new sensor: a single measuring point seems to be insufficient for a reconstruction of the characteristicWith two measuring points no degrees of freedom remain and the characteristic is clearly reconstructed.

Python source code of the example calculation

Example: Signal decomposition

In a signal processing application example, a temporal signal is to be decomposed into its components. Let it be known about the system that the signal consists of three components following the three covariance functions

k1(r)=2,72exp(r2)
k2(r)=2,72exp(0,4|sin(r/2,5)|)
k3(r)=0,62δr

The sum signal then follows the addition rule of the covariance function

ksum(r)=k1(r)+k2(r)+k3(r).

The following two figures show three random signals which were generated and added for demonstration with these covariance functions. In the sum of the signals one can hardly recognize the periodic signal hidden in it with the naked eye, since its spectral range overlaps with that of the two other components.

With the help of the operation decomposition the sum ysum can be decomposed again into the three components

y1=Σ1Σsum1ysum+3
y2=Σ2Σsum1ysum3
y3=Σ3Σsum1ysum

where (Σx)ij=kx(|tjti|). The estimate of the most likely decomposition shows how well the separation is possible in this case and how close the signals are to the original signals. The estimated uncertainties taking into account the cross-correlations are shown in the animation by random fluctuations.

The example shows how this method can be used to separate very different signals in one step. In contrast, other filtering methods such as moving averaging, Fourier filtering, polynomial regression, or spline approximation are optimized for specific signal characteristics and provide neither accurate error estimates nor cross-correlations.

If the Gaussian processes of the individual components for a given signal are not precisely known, then in some cases hypothesis testing can be performed using the log-marginal likelihood function, provided sufficient data are available to well-condition the function. Via its maximization, the parameters of the conjectured covariance functions can be fitted to the measured data.

Python source code of the example calculation

Literature

  • Template:Cite journal In: Olivier Bousquet (publisher): Advanced Lectures on Machine Learning. ML 2003. (= Lecture Notes in Computer Science. vol. 3176). Springer, Berlin/Heidelberg 2004.([1], pdf)
  • C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning. MIT Press, 2006, ISBN 0-262-18253-X. (gaussianprocess.org, pdf)
  • R. M. Dudley: Real Analysis and Probability. Wadsworth and Brooks/Cole, 1989.
  • B. Simon: Functional Integration and Quantum Physics. Academic Press, 1979.
  • M. L. Stein: Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999.

Educational material

Software

References

  1. ↑ Template:Citation
  2. ↑ C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning, Chapter 4.1 Preliminaries
  3. ↑ Topics in Probability: Gaussian Analysis, Math 7880-1, Spring 2015, University of Utah, Chapter 6 "Gaussian Processes", see Definition 1.7 for stationarity and Lemma 1.8 for translation invariance.
  4. ↑ C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning. MIT Press, 2006, ISBN 0-262-18253-X, chapter 4.2 Examples of Covariance Functions, page 85
  5. ↑ C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning. MIT Press, 2006, ISBN 0-262-18253-X, chapter 4.2 Examples of Covariance Functions, page 84
  6. ↑ C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning MIT Press, 2006, ISBN 0-262-18253-X, Chapter 4.2.2 Dot Product Covariance Functions. p. 89 and Table 4.1, p. 94.
  7. ↑ C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning. MIT Press, 2006, ISBN 0-262-18253-X, Chapter 4 "Covariance Functions", valid covariance functions are listed as "ND" in Table 4.1 on page 94.
  8. ↑ The derivation of the general linear transformation is based on the equation F𝒩(μ,Σ)=𝒩(Fμ,FΣF), by choosing the matrix F as [A B], μ as a vector (μ1 μ2) and Σ from corresponding four blocks.
  9. ↑ The derivation is based on the covariance rule for multiplication cov(Ax,By)=Acov(x,y)B and associativity cov(x,y+z)=cov(x,y)+cov(x,z).
  10. ↑ The transformation involves, for example, multiplying by 1 = Ξ£1/Ξ£1 or adding 0 = Ξ£1-Ξ£1 and truncating the inverse matrices accordingly.
  11. ↑ Template:Cite journal
  12. ↑ The strategy corresponds to the a posteriori Gaussian process with measurement uncertainties, see the chapter on Gaussian process regression in the textbook C. E. Rasmussen, C. K. I. Williams: Processes for Machine Learning, Chapter 2 Regression. The Kalman filter also uses data fusion to separate signals and measurement uncertainties.
  13. ↑ C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning, Chapter 4.2.4 Making New Kernels from Old. page 94.
  14. ↑ C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning, Chapter 5.2 Bayesian Model Selection. page 108.
  15. ↑ The data is available at Google trends for the search term "snowboard".

Template:One-page book Template:Shelves