# gpytorch.variational¶

There are many possible variants of variational/approximate GPs. GPyTorch makes use of 3 composible objects that make it possible to implement most GP approximations:

• VariationalDistribution, which define the form of the approximate inducing value posterior $$q(\mathbf u)$$.
• VarationalStrategies, which define how to compute $$q(\mathbf f(\mathbf X))$$ from $$q(\mathbf u)$$.
• _ApproximateMarginalLogLikelihood, which defines the objective function to learn the approximate posterior (e.g. variational ELBO).

All three of these objects should be used in conjunction with a gpytorch.models.ApproximateGP model.

## Variational Strategies¶

VariationalStrategy objects control how certain aspects of variational inference should be performed. In particular, they define two methods that get used during variational inference:

• The prior_distribution() method determines how to compute the GP prior distribution of the inducing points, e.g. $$p(u) \sim N(\mu(X_u), K(X_u, X_u))$$. Most commonly, this is done simply by calling the user defined GP prior on the inducing point data directly.
• The forward() method determines how to marginalize out the inducing point function values. Specifically, forward defines how to transform a variational distribution over the inducing point values, $$q(u)$$, in to a variational distribution over the function values at specified locations x, $$q(f|x)$$, by integrating $$\int p(f|x, u)q(u)du$$

In GPyTorch, we currently support two categories of this latter functionality. In scenarios where the inducing points are learned (or set to be exactly the training data), we apply the derivation in Hensman et al., 2015 to exactly marginalize out the variational distribution. When the inducing points are constrained to a grid, we apply the derivation in Wilson et al., 2016 and exploit a deterministic relationship between $$\mathbf f$$ and $$\mathbf u$$.

### _VariationalStrategy¶

class gpytorch.variational._VariationalStrategy(model, inducing_points, variational_distribution, learn_inducing_locations=True)[source]

Abstract base class for all Variational Strategies.

forward(x, inducing_points, inducing_values, variational_inducing_covar=None)[source]

The forward() method determines how to marginalize out the inducing point function values. Specifically, forward defines how to transform a variational distribution over the inducing point values, $$q(u)$$, in to a variational distribution over the function values at specified locations x, $$q(f|x)$$, by integrating $$\int p(f|x, u)q(u)du$$

Parameters: x (torch.Tensor) – Locations $$\mathbf X$$ to get the variational posterior of the function values at. inducing_points (torch.Tensor) – Locations $$\mathbf Z$$ of the inducing points inducing_values (torch.Tensor) – Samples of the inducing function values $$\mathbf u$$ (or the mean of the distribution $$q(\mathbf u)$$ if q is a Gaussian. variational_inducing_covar (LazyTensor) – If the distribuiton $$q(\mathbf u)$$ is Gaussian, then this variable is the covariance matrix of that Gaussian. Otherwise, it will be None. MultivariateNormal The distribution $$q( \mathbf f(\mathbf X))$$
kl_divergence()[source]

Compute the KL divergence between the variational inducing distribution $$q(\mathbf u)$$ and the prior inducing distribution $$p(\mathbf u)$$.

Return type: torch.Tensor
prior_distribution

The prior_distribution() method determines how to compute the GP prior distribution of the inducing points, e.g. $$p(u) \sim N(\mu(X_u), K(X_u, X_u))$$. Most commonly, this is done simply by calling the user defined GP prior on the inducing point data directly.

Return type: MultivariateNormal The distribution $$p( \mathbf u)$$

### VariationalStrategy¶

class gpytorch.variational.VariationalStrategy(model, inducing_points, variational_distribution, learn_inducing_locations=True)[source]

The standard variational strategy, as defined by Hensman et al. (2015). This strategy takes a set of $$m \ll n$$ inducing points $$\mathbf Z$$ and applies an approximate distribution $$q( \mathbf u)$$ over their function values. (Here, we use the common notation $$\mathbf u = f(\mathbf Z)$$. The approximate function distribution for any abitrary input $$\mathbf X$$ is given by:

$q( f(\mathbf X) ) = \int p( f(\mathbf X) \mid \mathbf u) q(\mathbf u) \: d\mathbf u$

This variational strategy uses “whitening” to accelerate the optimization of the variational parameters. See Matthews (2017) for more info.

Parameters: model (ApproximateGP) – Model this strategy is applied to. Typically passed in when the VariationalStrategy is created in the __init__ method of the user defined model. inducing_points (torch.Tensor) – Tensor containing a set of inducing points to use for variational inference. variational_distribution (VariationalDistribution) – A VariationalDistribution object that represents the form of the variational distribution $$q(\mathbf u)$$ learn_inducing_points (bool) – (optional, default True): Whether or not the inducing point locations $$\mathbf Z$$ should be learned (i.e. are they parameters of the model).

class gpytorch.variational.MultitaskVariationalStrategy(base_variational_strategy, num_tasks, task_dim=-1)[source]

MultitaskVariationalStrategy wraps an existing VariationalStrategy to product a MultitaskMultivariateNormal distribution. This is useful for multi-output variational models.

The base variational strategy is assumed to operate on a batch of GPs. One of the batch dimensions corresponds to the multiple tasks.

Parameters: base_variational_strategy (VariationalStrategy) – Base variational strategy task_dim (int) – (default=-1) Which batch dimension is the task dimension

### OrthogonallyDecoupledVariationalStrategy¶

class gpytorch.variational.OrthogonallyDecoupledVariationalStrategy(model, inducing_points, variational_distribution)[source]

Implements orthogonally decoupled VGPs as defined in Salimbeni et al. (2018). This variational strategy uses a different set of inducing points for the mean and covariance functions. The idea is to use more inducing points for the (computationally efficient) mean and fewer inducing points for the (computationally expensive) covaraince.

This variational strategy defines the inducing points/_VariationalDistribution for the mean function. It then wraps a different _VariationalStrategy which defines the covariance inducing points.

Example:
>>> mean_inducing_points = torch.randn(1000, train_x.size(-1), dtype=train_x.dtype, device=train_x.device)
>>> covar_inducing_points = torch.randn(100, train_x.size(-1), dtype=train_x.dtype, device=train_x.device)
>>>
>>> covar_variational_strategy = gpytorch.variational.VariationalStrategy(
>>>     model, covar_inducing_points,
>>>     gpytorch.variational.CholeskyVariationalDistribution(covar_inducing_points.size(-2)),
>>>     learn_inducing_locations=True
>>> )
>>>
>>> variational_strategy = gpytorch.variational.OrthogonallyDecoupledVariationalStrategy(
>>>     covar_variational_strategy, mean_inducing_points,
>>>     gpytorch.variational.DeltaVariationalDistribution(mean_inducing_points.size(-2)),
>>> )


### UnwhitenedVariationalStrategy¶

class gpytorch.variational.UnwhitenedVariationalStrategy(model, inducing_points, variational_distribution, learn_inducing_locations=True)[source]

Similar to VariationalStrategy, but does not perform the whitening operation. In almost all cases VariationalStrategy is preferable, with a few exceptions:

• When the inducing points are exactly equal to the training points (i.e. $$\mathbf Z = \mathbf X$$). Unwhitened models are faster in this case.
• When the number of inducing points is very large (e.g. >2000). Unwhitened models can use CG for faster computation.
Parameters: model (ApproximateGP) – Model this strategy is applied to. Typically passed in when the VariationalStrategy is created in the __init__ method of the user defined model. inducing_points (torch.Tensor) – Tensor containing a set of inducing points to use for variational inference. variational_distribution (VariationalDistribution) – A VariationalDistribution object that represents the form of the variational distribution $$q(\mathbf u)$$ learn_inducing_points (bool) – (optional, default True): Whether or not the inducing point locations $$\mathbf Z$$ should be learned (i.e. are they parameters of the model).

### GridInterpolationVariationalStrategy¶

class gpytorch.variational.GridInterpolationVariationalStrategy(model, grid_size, grid_bounds, variational_distribution)[source]

This strategy constrains the inducing points to a grid and applies a deterministic relationship between $$\mathbf f$$ and $$\mathbf u$$. It was introduced by Wilson et al. (2016).

Here, the inducing points are not learned. Instead, the strategy automatically creates inducing points based on a set of grid sizes and grid bounds.

Parameters: model (ApproximateGP) – Model this strategy is applied to. Typically passed in when the VariationalStrategy is created in the __init__ method of the user defined model. grid_size (int) – Size of the grid grid_bounds (list) – Bounds of each dimension of the grid (should be a list of (float, float) tuples) variational_distribution (VariationalDistribution) – A VariationalDistribution object that represents the form of the variational distribution $$q(\mathbf u)$$

## Variational Distributions¶

VariationalDistribution objects represent the variational distribution $$q(\mathbf u)$$ over a set of inducing points for GPs. Typically the distributions are some sort of parameterization of a multivariate normal distributions.

### _VariationalDistribution¶

class gpytorch.variational._VariationalDistribution(num_inducing_points, batch_shape=<MagicMock name='mock()' id='140227764346384'>, mean_init_std=0.001)[source]

Abstract base class for all Variational Distributions.

forward()[source]

Constructs and returns the variational distribution

Return type: MultivariateNormal The distribution :math:q(mathbf u)”
initialize_variational_distribution(prior_dist)[source]

Method for initializing the variational distribution, based on the prior distribution.

Parameters: prior_dist (Distribution) – The prior distribution $$p(\mathbf u)$$.

### CholeskyVariationalDistribution¶

class gpytorch.variational.CholeskyVariationalDistribution(num_inducing_points, batch_shape=<MagicMock name='mock()' id='140227762644752'>, mean_init_std=0.001, **kwargs)[source]

A _VariationalDistribution that is defined to be a multivariate normal distribution with a full covariance matrix.

The most common way this distribution is defined is to parameterize it in terms of a mean vector and a covariance matrix. In order to ensure that the covariance matrix remains positive definite, we only consider the lower triangle.

Parameters: num_inducing_points (int) – Size of the variational distribution. This implies that the variational mean should be this size, and the variational covariance matrix should have this many rows and columns. batch_shape (torch.Size) – (Optional.) Specifies an optional batch size for the variational parameters. This is useful for example when doing additive variational inference. mean_init_std (float) – (default=1e-3) Standard deviation of gaussian noise to add to the mean initialization.

### DeltaVariationalDistribution¶

class gpytorch.variational.DeltaVariationalDistribution(num_inducing_points, batch_shape=<MagicMock name='mock()' id='140227763291752'>, mean_init_std=0.001, **kwargs)[source]

This _VariationalDistribution object replaces a variational distribution with a single particle. It is equivalent to doing MAP inference.

Parameters: num_inducing_points (int) – Size of the variational distribution. This implies that the variational mean should be this size. batch_shape (torch.Size) – (Optional.) Specifies an optional batch size for the variational parameters. This is useful for example when doing additive variational inference. mean_init_std (float) – (default=1e-3) Standard deviation of gaussian noise to add to the mean initialization.

### MeanFieldVariationalDistribution¶

class gpytorch.variational.MeanFieldVariationalDistribution(num_inducing_points, batch_shape=<MagicMock name='mock()' id='140227764483296'>, mean_init_std=0.001, **kwargs)[source]

A _VariationalDistribution that is defined to be a multivariate normal distribution with a diagonal covariance matrix. This will not be as flexible/expressive as a CholeskyVariationalDistribution.

Parameters: num_inducing_points (int) – Size of the variational distribution. This implies that the variational mean should be this size, and the variational covariance matrix should have this many rows and columns. batch_shape (torch.Size) – (Optional.) Specifies an optional batch size for the variational parameters. This is useful for example when doing additive variational inference. mean_init_std (float) – (default=1e-3) Standard deviation of gaussian noise to add to the mean initialization.