# GPyTorch’s documentation¶

## GPyTorch Regression Tutorial¶

### Introduction¶

In this notebook, we demonstrate many of the design features of GPyTorch using the simplest example, training an RBF kernel Gaussian process on a simple function. We’ll be modeling the function

\begin{align} y &= \sin(2\pi x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 0.04) \end{align}

with 100 training examples, and testing on 51 test examples.

Note: this notebook is not necessarily intended to teach the mathematical background of Gaussian processes, but rather how to train a simple one and make predictions in GPyTorch. For a mathematical treatment, Chapter 2 of Gaussian Processes for Machine Learning provides a very thorough introduction to GP regression (this entire text is highly recommended): http://www.gaussianprocess.org/gpml/chapters/RW2.pdf

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:


#### Set up training data¶

In the next cell, we set up the training data for this example. We’ll be using 100 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels.

:

# Training data is 100 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 100)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * math.sqrt(0.04)


### Setting up the model¶

The next cell demonstrates the most critical features of a user-defined Gaussian process model in GPyTorch. Building a GP model in GPyTorch is different in a number of ways.

First in contrast to many existing GP packages, we do not provide full GP models for the user. Rather, we provide the tools necessary to quickly construct one. This is because we believe, analogous to building a neural network in standard PyTorch, it is important to have the flexibility to include whatever components are necessary. As can be seen in more complicated examples, this allows the user great flexibility in designing custom models.

For most GP regression models, you will need to construct the following GPyTorch objects:

1. A GP Model (gpytorch.models.ExactGP) - This handles most of the inference.
2. A Likelihood (gpytorch.likelihoods.GaussianLikelihood) - This is the most common likelihood used for GP regression.
3. A Mean - This defines the prior mean of the GP.(If you don’t know which mean to use, a gpytorch.means.ConstantMean() is a good place to start.)
4. A Kernel - This defines the prior covariance of the GP.(If you don’t know which kernel to use, a gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()) is a good place to start).
5. A MultivariateNormal Distribution (gpytorch.distributions.MultivariateNormal) - This is the object used to represent multivariate normal distributions.

#### The GP Model¶

The components of a user built (Exact, i.e. non-variational) GP model in GPyTorch are, broadly speaking:

1. An __init__ method that takes the training data and a likelihood, and constructs whatever objects are necessary for the model’s forward method. This will most commonly include things like a mean module and a kernel module.
2. A forward method that takes in some $$n \times d$$ data x and returns a MultivariateNormal with the prior mean and covariance evaluated at x. In other words, we return the vector $$\mu(x)$$ and the $$n \times n$$ matrix $$K_{xx}$$ representing the prior mean and covariance matrix of the GP.

This specification leaves a large amount of flexibility when defining a model. For example, to compose two kernels via addition, you can either add the kernel modules directly:

self.covar_module = ScaleKernel(RBFKernel() + WhiteNoiseKernel())


Or you can add the outputs of the kernel in the forward method:

covar_x = self.rbf_kernel_module(x) + self.white_noise_module(x)

:

# We will use the simplest form of GP model, exact inference
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)


#### Model modes¶

Like most PyTorch modules, the ExactGP has a .train() and .eval() mode. - .train() mode is for optimizing model hyperameters. - .eval() mode is for computing predictions through the model posterior.

### Training the model¶

In the next cell, we handle using Type-II MLE to train the hyperparameters of the Gaussian process.

The most obvious difference here compared to many other GP implementations is that, as in standard PyTorch, the core training loop is written by the user. In GPyTorch, we make use of the standard PyTorch optimizers as from torch.optim, and all trainable parameters of the model should be of type torch.nn.Parameter. Because GP models directly extend torch.nn.Module, calls to methods like model.parameters() or model.named_parameters() function as you might expect coming from PyTorch.

In most cases, the boilerplate code below will work well. It has the same basic components as the standard PyTorch training loop:

2. Call the model and compute the loss
3. Call backward on the loss to fill in gradients
4. Take a step on the optimizer

However, defining custom training loops allows for greater flexibility. For example, it is easy to save the parameters at each step of training, or use different learning rates for different parameters (which may be useful in deep kernel learning for example).

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
# Zero gradients from previous iteration
# Output from model
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()

Iter 1/50 - Loss: 0.939   lengthscale: 0.693   noise: 0.693
Iter 2/50 - Loss: 0.908   lengthscale: 0.644   noise: 0.644
Iter 3/50 - Loss: 0.874   lengthscale: 0.598   noise: 0.598
Iter 4/50 - Loss: 0.837   lengthscale: 0.555   noise: 0.554
Iter 5/50 - Loss: 0.795   lengthscale: 0.514   noise: 0.513
Iter 6/50 - Loss: 0.749   lengthscale: 0.476   noise: 0.474
Iter 7/50 - Loss: 0.699   lengthscale: 0.440   noise: 0.437
Iter 8/50 - Loss: 0.649   lengthscale: 0.405   noise: 0.402
Iter 9/50 - Loss: 0.600   lengthscale: 0.372   noise: 0.369
Iter 10/50 - Loss: 0.556   lengthscale: 0.342   noise: 0.339
Iter 11/50 - Loss: 0.516   lengthscale: 0.315   noise: 0.310
Iter 12/50 - Loss: 0.480   lengthscale: 0.291   noise: 0.284
Iter 13/50 - Loss: 0.448   lengthscale: 0.270   noise: 0.259
Iter 14/50 - Loss: 0.413   lengthscale: 0.254   noise: 0.237
Iter 15/50 - Loss: 0.380   lengthscale: 0.241   noise: 0.216
Iter 16/50 - Loss: 0.355   lengthscale: 0.231   noise: 0.197
Iter 17/50 - Loss: 0.314   lengthscale: 0.223   noise: 0.179
Iter 18/50 - Loss: 0.292   lengthscale: 0.218   noise: 0.163
Iter 19/50 - Loss: 0.262   lengthscale: 0.214   noise: 0.148
Iter 20/50 - Loss: 0.236   lengthscale: 0.214   noise: 0.135
Iter 21/50 - Loss: 0.201   lengthscale: 0.216   noise: 0.122
Iter 22/50 - Loss: 0.176   lengthscale: 0.220   noise: 0.111
Iter 23/50 - Loss: 0.158   lengthscale: 0.224   noise: 0.102
Iter 24/50 - Loss: 0.125   lengthscale: 0.231   noise: 0.093
Iter 25/50 - Loss: 0.101   lengthscale: 0.239   noise: 0.085
Iter 26/50 - Loss: 0.078   lengthscale: 0.247   noise: 0.077
Iter 27/50 - Loss: 0.066   lengthscale: 0.256   noise: 0.071
Iter 28/50 - Loss: 0.052   lengthscale: 0.265   noise: 0.065
Iter 29/50 - Loss: 0.036   lengthscale: 0.276   noise: 0.060
Iter 30/50 - Loss: 0.036   lengthscale: 0.286   noise: 0.056
Iter 31/50 - Loss: 0.031   lengthscale: 0.297   noise: 0.052
Iter 32/50 - Loss: 0.028   lengthscale: 0.306   noise: 0.048
Iter 33/50 - Loss: 0.030   lengthscale: 0.315   noise: 0.045
Iter 34/50 - Loss: 0.035   lengthscale: 0.322   noise: 0.043
Iter 35/50 - Loss: 0.039   lengthscale: 0.326   noise: 0.041
Iter 36/50 - Loss: 0.043   lengthscale: 0.329   noise: 0.039
Iter 37/50 - Loss: 0.047   lengthscale: 0.327   noise: 0.038
Iter 38/50 - Loss: 0.052   lengthscale: 0.323   noise: 0.037
Iter 39/50 - Loss: 0.048   lengthscale: 0.317   noise: 0.036
Iter 40/50 - Loss: 0.051   lengthscale: 0.309   noise: 0.036
Iter 41/50 - Loss: 0.051   lengthscale: 0.302   noise: 0.036
Iter 42/50 - Loss: 0.047   lengthscale: 0.295   noise: 0.036
Iter 43/50 - Loss: 0.048   lengthscale: 0.288   noise: 0.036
Iter 44/50 - Loss: 0.047   lengthscale: 0.281   noise: 0.037
Iter 45/50 - Loss: 0.047   lengthscale: 0.276   noise: 0.037
Iter 46/50 - Loss: 0.040   lengthscale: 0.273   noise: 0.038
Iter 47/50 - Loss: 0.037   lengthscale: 0.271   noise: 0.039
Iter 48/50 - Loss: 0.040   lengthscale: 0.270   noise: 0.040
Iter 49/50 - Loss: 0.033   lengthscale: 0.269   noise: 0.042
Iter 50/50 - Loss: 0.032   lengthscale: 0.269   noise: 0.043


### Make predictions with the model¶

In the next cell, we make predictions with the model. To do this, we simply put the model and likelihood in eval mode, and call both modules on the test data.

Just as a user defined GP model returns a MultivariateNormal containing the prior mean and covariance from forward, a trained GP model in eval mode returns a MultivariateNormal containing the posterior mean and covariance. Thus, getting the predictive mean and variance, and then sampling functions from the GP at the given test points could be accomplished with calls like:

f_preds = model(test_x)
y_preds = likelihood(model(test_x))

f_mean = f_preds.mean
f_var = f_preds.variance
f_covar = f_preds.covariance_matrix
f_samples = f_preds.sample(sample_shape=torch.Size(1000,))


The gpytorch.settings.fast_pred_var context is not needed, but here we are giving a preview of using one of our cool features, getting faster predictive distributions using LOVE.

:

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Test points are regularly spaced along [0,1]
# Make predictions by feeding model through likelihood
test_x = torch.linspace(0, 1, 51)
observed_pred = likelihood(model(test_x))


### Plot the model fit¶

In the next cell, we plot the mean and confidence region of the Gaussian process model. The confidence_region method is a helper method that returns 2 standard deviations above and below the mean.

:

with torch.no_grad():
# Initialize plot
f, ax = plt.subplots(1, 1, figsize=(4, 3))

# Get upper and lower confidence bounds
lower, upper = observed_pred.confidence_region()
# Plot training data as black stars
ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
# Plot predictive means as blue line
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence']) [ ]:




## Basic Usage¶

This folder contains notebooks for basic usage of the package, e.g. things like dealing with hyperparameters, parameter constraints and priors, and saving and loading models.

Before checking these out, you may want to check out our simple GP regression tutorial that details the anatomy of a GPyTorch model.

### Hyperparameters in GPyTorch¶

The purpose of this notebook is to explain how GP hyperparameters in GPyTorch work, how they are handled, what options are available for constraints and priors, and how things may differ from other packages.

Note: This is a basic introduction to hyperparameters in GPyTorch. If you want to use GPyTorch hyperparameters with things like Pyro distributions, that will be covered in a less “basic usage” tutorial.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

from IPython.display import Markdown, display
def printmd(string):
display(Markdown(string))


#### Defining an example model¶

In the next cell, we define our simple exact GP from the Simple GP Regression tutorial. We’ll be using this model to demonstrate certain aspects of hyperparameter creation.

:

train_x = torch.linspace(0, 1, 100)
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2

# We will use the simplest form of GP model, exact inference
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)


#### Viewing model hyperparameters¶

Let’s take a look at the model parameters. By “parameters”, here I mean explicitly objects of type torch.nn.Parameter that will have gradients filled in by autograd. To access these, there are two ways of doing this in torch. One way is to use model.state_dict(), which we demonstrate the use of for saving models here.

In the next cell we demonstrate another way to do this, by looping over the model.named_parameters() generator:

:

for param_name, param in model.named_parameters():
print(f'Parameter name: {param_name:42} value = {param.item()}')

Parameter name: likelihood.noise_covar.raw_noise           value = 0.0
Parameter name: mean_module.constant                       value = 0.0
Parameter name: covar_module.raw_outputscale               value = 0.0
Parameter name: covar_module.base_kernel.raw_lengthscale   value = 0.0

##### Raw vs Actual Parameters¶

The most important thing to note here is that the actual learned parameters of the model are things like raw_noise, raw_outputscale, raw_lengthscale, etc. The reason for this is that these parameters must be positive. This brings us to our next topic for parameters: constraints, and the difference between raw parameters and actual parameters.

In order to enforce positiveness and other constraints for hyperparameters, GPyTorch has raw parameters (e.g., model.covar_module.raw_outputscale) that are transformed to actual values via some constraint. Let’s take a look at the raw outputscale, its constraint, and the final value:

:

raw_outputscale = model.covar_module.raw_outputscale
print('raw_outputscale, ', raw_outputscale)

# Three ways of accessing the raw outputscale constraint
print('\nraw_outputscale_constraint1', model.covar_module.raw_outputscale_constraint)

printmd('\n\n**Printing all model constraints...**\n')
for constraint_name, constraint in model.named_constraints():
print(f'Constraint name: {constraint_name:55} constraint = {constraint}')

printmd('\n**Getting raw outputscale constraint from model...**')
print(model.constraint_for_parameter_name("covar_module.raw_outputscale"))

printmd('\n**Getting raw outputscale constraint from model.covar_module...**')
print(model.covar_module.constraint_for_parameter_name("raw_outputscale"))

raw_outputscale,  Parameter containing:

raw_outputscale_constraint1 Positive()

Printing all model constraints…
Constraint name: likelihood.noise_covar.raw_noise_constraint             constraint = GreaterThan(1.000E-04)
Constraint name: covar_module.raw_outputscale_constraint                 constraint = Positive()
Constraint name: covar_module.base_kernel.raw_lengthscale_constraint     constraint = Positive()

Getting raw outputscale constraint from model…
Positive()

Getting raw outputscale constraint from model.covar_module…
Positive()

##### How do constraints work?¶

Constraints define transform and inverse_transform methods that turn raw parameters in to real ones. For a positive constraint, we expect the transformed values to always be positive. Let’s see:

:

raw_outputscale = model.covar_module.raw_outputscale
constraint = model.covar_module.raw_outputscale_constraint

print('Transformed outputscale', constraint.transform(raw_outputscale))
print(constraint.inverse_transform(constraint.transform(raw_outputscale)))
print(torch.equal(constraint.inverse_transform(constraint.transform(raw_outputscale)), raw_outputscale))

print('Transform a bunch of negative tensors: ', constraint.transform(torch.tensor([-1., -2., -3.])))

Transformed outputscale tensor(0.6931, grad_fn=<SoftplusBackward>)
True
Transform a bunch of negative tensors:  tensor([0.3133, 0.1269, 0.0486])

##### Convenience Getters/Setters for Transformed Values¶

Because dealing with raw parameter values is annoying (e.g., we might know what a noise variance of 0.01 means, but maybe not a raw_noise of -2.791), virtually all built in GPyTorch modules that define raw parameters define convenience getters and setters for dealing with transformed values directly.

In the next cells, we demonstrate the “inconvenient way” and the “convenient” way of getting and setting the outputscale.

:

# Recreate model to reset outputscale
model = ExactGPModel(train_x, train_y, likelihood)

# Inconvenient way of getting true outputscale
raw_outputscale = model.covar_module.raw_outputscale
constraint = model.covar_module.raw_outputscale_constraint
outputscale = constraint.transform(raw_outputscale)
print(f'Actual outputscale: {outputscale.item()}')

# Inconvenient way of setting true outputscale
model.covar_module.raw_outputscale.data.fill_(constraint.inverse_transform(torch.tensor(2.)))
raw_outputscale = model.covar_module.raw_outputscale
outputscale = constraint.transform(raw_outputscale)
print(f'Actual outputscale after setting: {outputscale.item()}')

Actual outputscale: 0.6931471824645996
Actual outputscale after setting: 2.0


Ouch, that is ugly! Fortunately, there is a better way:

:

# Recreate model to reset outputscale
model = ExactGPModel(train_x, train_y, likelihood)

# Convenient way of getting true outputscale
print(f'Actual outputscale: {model.covar_module.outputscale}')

# Convenient way of setting true outputscale
model.covar_module.outputscale = 2.
print(f'Actual outputscale after setting: {model.covar_module.outputscale}')

Actual outputscale: 0.6931471824645996
Actual outputscale after setting: 2.0


#### Changing Parameter Constraints¶

If we look at the actual noise of the model, GPyTorch defines a default lower bound of 1e-4 for the noise variance:

:

print(f'Actual noise value: {likelihood.noise}')
print(f'Noise constraint: {likelihood.noise_covar.raw_noise_constraint}')

Actual noise value: tensor([0.6932], grad_fn=<AddBackward0>)
Noise constraint: GreaterThan(1.000E-04)


We can change the noise constraint either on the fly or when the likelihood is created:

:

likelihood = gpytorch.likelihoods.GaussianLikelihood(noise_constraint=gpytorch.constraints.GreaterThan(1e-3))
print(f'Noise constraint: {likelihood.noise_covar.raw_noise_constraint}')

## Changing the constraint after the module has been created
likelihood.noise_covar.register_constraint("raw_noise", gpytorch.constraints.Positive())
print(f'Noise constraint: {likelihood.noise_covar.raw_noise_constraint}')

Noise constraint: GreaterThan(1.000E-03)
Noise constraint: Positive()


#### Priors¶

In GPyTorch, priors are things you register to the model that act on any arbitrary function of any parameter. Like constraints, these can usually be defined either when you create an object (like a Kernel or Likelihood), or set afterwards on the fly.

Here are some examples:

:

# Registers a prior on the sqrt of the noise parameter
# (e.g., a prior for the noise standard deviation instead of variance)
likelihood.noise_covar.register_prior(
"noise_std_prior",
gpytorch.priors.NormalPrior(0, 1),
lambda: likelihood.noise.sqrt()
)

# Create a GaussianLikelihood with a normal prior for the noise
likelihood = gpytorch.likelihoods.GaussianLikelihood(
noise_constraint=gpytorch.constraints.GreaterThan(1e-3),
noise_prior=gpytorch.priors.NormalPrior(0, 1)
)


#### Putting it Together¶

In the next cell, we augment our ExactGP definition to place several priors over hyperparameters and tighter constraints when creating the model.

:

# We will use the simplest form of GP model, exact inference
class FancyGPWithPriors(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(FancyGPWithPriors, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()

lengthscale_prior = gpytorch.priors.GammaPrior(3.0, 6.0)
outputscale_prior = gpytorch.priors.GammaPrior(2.0, 0.15)

self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel(
lengthscale_prior=lengthscale_prior,
),
outputscale_prior=outputscale_prior
)

# Initialize lengthscale and outputscale to mean of priors
self.covar_module.base_kernel.lengthscale = lengthscale_prior.mean
self.covar_module.outputscale = outputscale_prior.mean

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.GaussianLikelihood(
noise_constraint=gpytorch.constraints.GreaterThan(1e-2),
)

model = FancyGPWithPriors(train_x, train_y, likelihood)


#### Initializing hyperparameters in One Call¶

For convenience, GPyTorch modules also define an initialize method that allow you to update a full dictionary of parameters on submodules. For example:

:

hypers = {
'likelihood.noise_covar.noise': torch.tensor(1.),
'covar_module.base_kernel.lengthscale': torch.tensor(0.5),
'covar_module.outputscale': torch.tensor(2.),
}

model.initialize(**hypers)
print(
model.likelihood.noise_covar.noise.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.covar_module.outputscale.item()
)

1.0000001192092896 0.4999999701976776 2.0


In this bite-sized notebook, we’ll go over how to save and load models. In general, the process is the same as for any PyTorch module.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt


#### Saving a Simple Model¶

First, we define a GP Model that we’d like to save. The model used below is the same as the model from our Simple GP Regression tutorial.

:

train_x = torch.linspace(0, 1, 100)
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2

:

# We will use the simplest form of GP model, exact inference
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

##### Change Model State¶

To demonstrate model saving, we change the hyperparameters from the default values below. For more information on what is happening here, see our tutorial notebook on Initializing Hyperparameters.

:

model.covar_module.outputscale = 1.2
model.covar_module.base_kernel.lengthscale = 2.2

##### Getting Model State¶

To get the full state of a GPyTorch model, simply call state_dict as you would on any PyTorch model. Note that the state dict contains raw parameter values. This is because these are the actual torch.nn.Parameters that are learned in GPyTorch. Again see our notebook on hyperparamters for more information on this.

:

model.state_dict()

:

OrderedDict([('likelihood.noise_covar.raw_noise', tensor([0.])),
('mean_module.constant', tensor([0.])),
('covar_module.raw_outputscale', tensor(0.8416)),
('covar_module.base_kernel.raw_lengthscale', tensor([[2.0826]]))])

##### Saving Model State¶

The state dictionary above represents all traininable parameters for the model. Therefore, we can save this to a file as follows:

:

torch.save(model.state_dict(), 'model_state.pth')


Next, we load this state in to a new model and demonstrate that the parameters were updated correctly.

:

state_dict = torch.load('model_state.pth')
model = ExactGPModel(train_x, train_y, likelihood)  # Create a new GP model


:

IncompatibleKeys(missing_keys=[], unexpected_keys=[])

:

model.state_dict()

:

OrderedDict([('likelihood.noise_covar.raw_noise', tensor([0.])),
('mean_module.constant', tensor([0.])),
('covar_module.raw_outputscale', tensor(0.8416)),
('covar_module.base_kernel.raw_lengthscale', tensor([[2.0826]]))])


#### A More Complex Example¶

Next we demonstrate this same principle on a more complex exact GP where we have a simple feed forward neural network feature extractor as part of the model.

:

class GPWithNNFeatureExtractor(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPWithNNFeatureExtractor, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

self.feature_extractor = torch.nn.Sequential(
torch.nn.Linear(1, 2),
torch.nn.BatchNorm1d(2),
torch.nn.ReLU(),
torch.nn.Linear(2, 2),
torch.nn.BatchNorm1d(2),
torch.nn.ReLU(),
)

def forward(self, x):
x = self.feature_extractor(x)
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPWithNNFeatureExtractor(train_x, train_y, likelihood)

##### Getting Model State¶

In the next cell, we once again print the model state via model.state_dict(). As you can see, the state is substantially more complex, as the model now includes our neural network parameters. Nevertheless, saving and loading is straight forward.

:

model.state_dict()

:

OrderedDict([('likelihood.noise_covar.raw_noise', tensor([0.])),
('mean_module.constant', tensor([0.])),
('covar_module.raw_outputscale', tensor(0.)),
('covar_module.base_kernel.raw_lengthscale', tensor([[0.]])),
('feature_extractor.0.weight', tensor([[-0.9135],
[-0.5942]])),
('feature_extractor.0.bias', tensor([ 0.9119, -0.0663])),
('feature_extractor.1.weight', tensor([0.2263, 0.2209])),
('feature_extractor.1.bias', tensor([0., 0.])),
('feature_extractor.1.running_mean', tensor([0., 0.])),
('feature_extractor.1.running_var', tensor([1., 1.])),
('feature_extractor.1.num_batches_tracked', tensor(0)),
('feature_extractor.3.weight', tensor([[-0.6375, -0.6466],
[-0.0563, -0.4695]])),
('feature_extractor.3.bias', tensor([-0.1247,  0.0803])),
('feature_extractor.4.weight', tensor([0.0466, 0.7248])),
('feature_extractor.4.bias', tensor([0., 0.])),
('feature_extractor.4.running_mean', tensor([0., 0.])),
('feature_extractor.4.running_var', tensor([1., 1.])),
('feature_extractor.4.num_batches_tracked', tensor(0))])

:

torch.save(model.state_dict(), 'my_gp_with_nn_model.pth')
model = GPWithNNFeatureExtractor(train_x, train_y, likelihood)

:

IncompatibleKeys(missing_keys=[], unexpected_keys=[])

:

model.state_dict()

:

OrderedDict([('likelihood.noise_covar.raw_noise', tensor([0.])),
('mean_module.constant', tensor([0.])),
('covar_module.raw_outputscale', tensor(0.)),
('covar_module.base_kernel.raw_lengthscale', tensor([[0.]])),
('feature_extractor.0.weight', tensor([[-0.9135],
[-0.5942]])),
('feature_extractor.0.bias', tensor([ 0.9119, -0.0663])),
('feature_extractor.1.weight', tensor([0.2263, 0.2209])),
('feature_extractor.1.bias', tensor([0., 0.])),
('feature_extractor.1.running_mean', tensor([0., 0.])),
('feature_extractor.1.running_var', tensor([1., 1.])),
('feature_extractor.1.num_batches_tracked', tensor(0)),
('feature_extractor.3.weight', tensor([[-0.6375, -0.6466],
[-0.0563, -0.4695]])),
('feature_extractor.3.bias', tensor([-0.1247,  0.0803])),
('feature_extractor.4.weight', tensor([0.0466, 0.7248])),
('feature_extractor.4.bias', tensor([0., 0.])),
('feature_extractor.4.running_mean', tensor([0., 0.])),
('feature_extractor.4.running_var', tensor([1., 1.])),
('feature_extractor.4.num_batches_tracked', tensor(0))])

[ ]:




## Exact GPs (Regression)¶

Regression with a Gaussian noise model is the cannonical example of Gaussian processes. These examples will work for small to medium sized datasets (~2,000 data points). All examples here use exact GP inference.

### GP Regression with a Spectral Mixture Kernel¶

#### Introduction¶

This example shows how to use a SpectralMixtureKernel module on an ExactGP model. This module is designed for

• When you want to use exact inference (e.g. for regression)
• When you want to use a more sophisticated kernel than RBF

The Spectral Mixture (SM) kernel was invented and discussed in Wilson et al., 2013.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline


In the next cell, we set up the training data for this example. We’ll be using 15 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels.

:

train_x = torch.linspace(0, 1, 15)
train_y = torch.sin(train_x * (2 * math.pi))


#### Set up the model¶

The model should be very similar to the ExactGP model in the simple regression example.

The only difference is here, we’re using a more complex kernel (the SpectralMixtureKernel). This kernel requires careful initialization to work properly. To that end, in the model __init__ function, we call

self.covar_module = gpytorch.kernels.SpectralMixtureKernel(n_mixtures=4)
self.covar_module.initialize_from_data(train_x, train_y)


This ensures that, when we perform optimization to learn kernel hyperparameters, we will be starting from a reasonable initialization.

:

class SpectralMixtureGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(SpectralMixtureGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.SpectralMixtureKernel(num_mixtures=4)
self.covar_module.initialize_from_data(train_x, train_y)

def forward(self,x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = SpectralMixtureGPModel(train_x, train_y, likelihood)


In the next cell, we handle using Type-II MLE to train the hyperparameters of the Gaussian process. The spectral mixture kernel’s hyperparameters start from what was specified in initialize_from_data.

:

# Find optimal model hyperparameters
model.train()
likelihood.train()

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

training_iter = 100
for i in range(training_iter):
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iter, loss.item()))
optimizer.step()

Iter 1/100 - Loss: 1.339
Iter 2/100 - Loss: 1.303
Iter 3/100 - Loss: 1.248
Iter 4/100 - Loss: 1.219
Iter 5/100 - Loss: 1.170
Iter 6/100 - Loss: 1.109
Iter 7/100 - Loss: 1.088
Iter 8/100 - Loss: 1.059
Iter 9/100 - Loss: 1.008
Iter 10/100 - Loss: 0.953
Iter 11/100 - Loss: 0.906
Iter 12/100 - Loss: 0.887
Iter 13/100 - Loss: 0.811
Iter 14/100 - Loss: 0.762
Iter 15/100 - Loss: 0.715
Iter 16/100 - Loss: 0.683
Iter 17/100 - Loss: 0.639
Iter 18/100 - Loss: 0.585
Iter 19/100 - Loss: 0.542
Iter 20/100 - Loss: 0.498
Iter 21/100 - Loss: 0.467
Iter 22/100 - Loss: 0.402
Iter 23/100 - Loss: 0.349
Iter 24/100 - Loss: 0.313
Iter 25/100 - Loss: 0.251
Iter 26/100 - Loss: 0.214
Iter 27/100 - Loss: 0.172
Iter 28/100 - Loss: 0.148
Iter 29/100 - Loss: 0.122
Iter 30/100 - Loss: 0.036
Iter 31/100 - Loss: -0.020
Iter 32/100 - Loss: -0.073
Iter 33/100 - Loss: -0.102
Iter 34/100 - Loss: -0.163
Iter 35/100 - Loss: -0.146
Iter 36/100 - Loss: -0.174
Iter 37/100 - Loss: -0.216
Iter 38/100 - Loss: -0.289
Iter 39/100 - Loss: -0.393
Iter 40/100 - Loss: -0.430
Iter 41/100 - Loss: -0.331
Iter 42/100 - Loss: -0.388
Iter 43/100 - Loss: -0.504
Iter 44/100 - Loss: -0.629
Iter 45/100 - Loss: -0.570
Iter 46/100 - Loss: -0.578
Iter 47/100 - Loss: -0.728
Iter 48/100 - Loss: -0.787
Iter 49/100 - Loss: -0.186
Iter 50/100 - Loss: -0.532
Iter 51/100 - Loss: -0.850
Iter 52/100 - Loss: -0.914
Iter 53/100 - Loss: -0.879
Iter 54/100 - Loss: -0.815
Iter 55/100 - Loss: -0.804
Iter 56/100 - Loss: -0.808
Iter 57/100 - Loss: -0.850
Iter 58/100 - Loss: -0.939
Iter 59/100 - Loss: -1.047
Iter 60/100 - Loss: -1.128
Iter 61/100 - Loss: -1.181
Iter 62/100 - Loss: -1.210
Iter 63/100 - Loss: -1.181
Iter 64/100 - Loss: -1.044
Iter 65/100 - Loss: -0.988
Iter 66/100 - Loss: -1.113
Iter 67/100 - Loss: -1.085
Iter 68/100 - Loss: -1.284
Iter 69/100 - Loss: -1.252
Iter 70/100 - Loss: -1.305
Iter 71/100 - Loss: -1.318
Iter 72/100 - Loss: -1.300
Iter 73/100 - Loss: -1.312
Iter 74/100 - Loss: -1.365
Iter 75/100 - Loss: -1.418
Iter 76/100 - Loss: -1.484
Iter 77/100 - Loss: -1.564
Iter 78/100 - Loss: -1.599
Iter 79/100 - Loss: -1.678
Iter 80/100 - Loss: -1.731
Iter 81/100 - Loss: -1.793
Iter 82/100 - Loss: -1.790
Iter 83/100 - Loss: -1.801
Iter 84/100 - Loss: -1.832
Iter 85/100 - Loss: -1.916
Iter 86/100 - Loss: -1.974
Iter 87/100 - Loss: -2.030
Iter 88/100 - Loss: -2.108
Iter 89/100 - Loss: -2.166
Iter 90/100 - Loss: -2.152
Iter 91/100 - Loss: -2.119
Iter 92/100 - Loss: -2.088
Iter 93/100 - Loss: -2.101
Iter 94/100 - Loss: -2.174
Iter 95/100 - Loss: -2.247
Iter 96/100 - Loss: -2.223
Iter 97/100 - Loss: -1.789
Iter 98/100 - Loss: -2.284
Iter 99/100 - Loss: -2.083
Iter 100/100 - Loss: -2.164


Now that we’ve learned good hyperparameters, it’s time to use our model to make predictions. The spectral mixture kernel is especially good at extrapolation. To that end, we’ll see how well the model extrapolates past the interval [0, 1].

In the next cell, we plot the mean and confidence region of the Gaussian process model. The confidence_region method is a helper method that returns 2 standard deviations above and below the mean.

:

# Test points every 0.1 between 0 and 5
test_x = torch.linspace(0, 5, 51)

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# The gpytorch.settings.fast_pred_var flag activates LOVE (for fast variances)
# See https://arxiv.org/abs/1803.06058
# Make predictions
observed_pred = likelihood(model(test_x))

# Initialize plot
f, ax = plt.subplots(1, 1, figsize=(4, 3))

# Get upper and lower confidence bounds
lower, upper = observed_pred.confidence_region()
# Plot training data as black stars
ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
# Plot predictive means as blue line
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence']) [ ]:




### Fully Bayesian GPs - Sampling Hyperparamters with NUTS¶

In this notebook, we’ll demonstrate how to integrate GPyTorch and NUTS to sample GP hyperparameters and perform GP inference in a fully Bayesian way.

The high level overview of sampling in GPyTorch is as follows:

1. Define your model as normal, extending ExactGP and defining a forward method.
2. For each parameter your model defines, you’ll need to register a GPyTorch prior with that parameter, or some function of the parameter. If you use something other than a default closure (e.g., by specifying a parameter or transformed parameter name), you’ll need to also specify a setting_closure: see the docs for gpytorch.Module.register_prior.
3. Define a pyro model that has a sample site for each GP parameter, and then computes a loss. For your convenience, we define a pyro_sample_from_prior method on gpytorch.Module that does the former operation. For the latter operation, just call mll.pyro_factor(output, y) instead of mll(output, y) to get your loss.
4. Run NUTS (or HMC etc) on the pyro model you just defined to generate samples. Note this can take quite a while or no time at all depending on the priors you’ve defined.
5. Load the samples in to the model, converting the model from a simple GP to a batch GP (see our example notebook on simple batch GPs), where each GP in the batch corresponds to a different hyperparameter sample.
6. Pass test data through the batch GP to get predictions for each hyperparameter sample.
:

import math
import torch
import gpytorch
import pyro
from pyro.infer.mcmc import NUTS, MCMC
from matplotlib import pyplot as plt

%matplotlib inline

:

# Training data is 11 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 6)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2

:

# We will use the simplest form of GP model, exact inference
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.PeriodicKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


#### Running Sampling¶

The next cell is the first piece of code that differs substantially from other work flows. In it, we create the model and likelihood as normal, and then register priors to each of the parameters of the model. Note that we directly can register priors to transformed parameters (e.g., “lengthscale”) rather than raw ones (e.g., “raw_lengthscale”). This is useful, however you’ll need to specify a prior whose support is fully contained in the domain of the parameter. For example, a lengthscale prior must have support only over the positive reals or a subset thereof.

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
num_samples = 2 if smoke_test else 100
warmup_steps = 2 if smoke_test else 200

from gpytorch.priors import LogNormalPrior, NormalPrior, UniformPrior
# Use a positive constraint instead of usual GreaterThan(1e-4) so that LogNormal has support over full range.
likelihood = gpytorch.likelihoods.GaussianLikelihood(noise_constraint=gpytorch.constraints.Positive())
model = ExactGPModel(train_x, train_y, likelihood)

model.mean_module.register_prior("mean_prior", UniformPrior(-1, 1), "constant")
model.covar_module.base_kernel.register_prior("lengthscale_prior", UniformPrior(0.01, 0.5), "lengthscale")
model.covar_module.base_kernel.register_prior("period_length_prior", UniformPrior(0.05, 2.5), "period_length")
model.covar_module.register_prior("outputscale_prior", UniformPrior(1, 2), "outputscale")
likelihood.register_prior("noise_prior", UniformPrior(0.05, 0.3), "noise")

mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

def pyro_model(x, y):
model.pyro_sample_from_prior()
output = model(x)
loss = mll.pyro_factor(output, y)
return y

mcmc_run = MCMC(nuts_kernel, num_samples=num_samples, warmup_steps=warmup_steps, disable_progbar=smoke_test)
mcmc_run.run(train_x, train_y)

Sample: 100%|█████████████████████████████████████████| 300/300 [00:17, 17.30it/s, step size=6.50e-01, acc. prob=0.895]


In the next cell, we load the samples generated by NUTS in to the model. This converts model from a single GP to a batch of num_samples GPs, in this case 100.

:

model.pyro_load_from_samples(mcmc_run.get_samples())

:

model.eval()
test_x = torch.linspace(0, 1, 101).unsqueeze(-1)
test_y = torch.sin(test_x * (2 * math.pi))
expanded_test_x = test_x.unsqueeze(0).repeat(num_samples, 1, 1)
output = model(expanded_test_x)


#### Plot Mean Functions¶

In the next cell, we plot the first 25 mean functions on the samep lot. This particular example has a fairly large amount of data for only 1 dimension, so the hyperparameter posterior is quite tight and there is relatively little variance.

:

with torch.no_grad():
# Initialize plot
f, ax = plt.subplots(1, 1, figsize=(4, 3))

# Plot training data as black stars
ax.plot(train_x.numpy(), train_y.numpy(), 'k*', zorder=10)

for i in range(min(num_samples, 25)):
# Plot predictive means as blue line
ax.plot(test_x.numpy(), output.mean[i].detach().numpy(), 'b', linewidth=0.3)

# Shade between the lower and upper confidence bounds
# ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Sampled Means']) Loading a fully Bayesian model from disk is slightly different from loading a standard model because the process of sampling changes the shapes of the model’s parameters. To account for this, you’ll need to call load_strict_shapes(False) on the model before loading the state dict. In the cell below, we demonstrate this by recreating the model and loading from the state dict.

Note that without the load_strict_shapes call, this would fail.

:

state_dict = model.state_dict()
model = ExactGPModel(train_x, train_y, likelihood)

# Load parameters without standard shape checking.


[ ]:




### GP Regression with Uncertain Inputs¶

#### Introduction¶

In this notebook, we’re going to demonstrate one way of dealing with uncertainty in our training data. Let’s say that we’re collecting training data that models the following function.

\begin{align} y &= \sin(2\pi x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 0.2) \end{align}

However, now assume that we’re a bit uncertain about our features. In particular, we’re going to assume that every x_i value is not a point but a distribution instead. E.g.

$x_i \sim \mathcal{N}(\mu_i, \sigma_i).$
##### Using a distributional kernel to deal with uncertain inputs¶

Rather than using a variational method (see the GP Regression with Uncertian Inputs tutorial in the variational examples), if we explicitly know the type of uncertainty in our inputs we can pass that into our kernel.

More specifically, assuming Gaussian inputs, we will compute the symmetrized KL divergence between the Gaussian inputs.

:

import math
import torch
import tqdm
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

:

# Training data is 100 points in [0,1] inclusive regularly spaced
train_x_mean = torch.linspace(0, 1, 20)
# We'll assume the variance shrinks the closer we get to 1
train_x_stdv = torch.linspace(0.03, 0.01, 20)

# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x_mean * (2 * math.pi)) + torch.randn(train_x_mean.size()) * 0.2


To effectively pass in the training distributional data, we will need to stack the mean and log variances.

:

train_x_distributional = torch.stack((train_x_mean, (train_x_stdv**2).log()), dim=1)

:

f, ax = plt.subplots(1, 1, figsize=(8, 3))
ax.errorbar(train_x_mean, train_y, xerr=(train_x_stdv * 2), fmt="k*", label="Train Data")
ax.legend()

:

<matplotlib.legend.Legend at 0x7fc3069399d0> We train the hyperparameters of the resulting distributional GP via type-II gradient descent, as is standard in many settings. We could also do fully Bayesian inference.

:

from gpytorch.models import ExactGP
from gpytorch.kernels import GaussianSymmetrizedKLKernel, ScaleKernel
from gpytorch.means import ConstantMean

class ExactGPModel(ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = ConstantMean()
self.covar_module = ScaleKernel(GaussianSymmetrizedKLKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x_distributional, train_y, likelihood)

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 500

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.25)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
# Zero gradients from previous iteration
# Output from model
output = model(train_x_distributional)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()

Iter 1/500 - Loss: 1.274   lengthscale: 0.693   noise: 0.693
Iter 2/500 - Loss: 1.242   lengthscale: 0.826   noise: 0.576
Iter 3/500 - Loss: 1.173   lengthscale: 0.955   noise: 0.475
Iter 4/500 - Loss: 1.141   lengthscale: 1.102   noise: 0.389
Iter 5/500 - Loss: 1.121   lengthscale: 1.266   noise: 0.317
Iter 6/500 - Loss: 1.097   lengthscale: 1.442   noise: 0.261
Iter 7/500 - Loss: 1.093   lengthscale: 1.621   noise: 0.220
Iter 8/500 - Loss: 1.112   lengthscale: 1.801   noise: 0.192
Iter 9/500 - Loss: 1.112   lengthscale: 1.988   noise: 0.177
Iter 10/500 - Loss: 1.095   lengthscale: 2.188   noise: 0.172
Iter 11/500 - Loss: 1.081   lengthscale: 2.401   noise: 0.172
Iter 12/500 - Loss: 1.068   lengthscale: 2.624   noise: 0.177
Iter 13/500 - Loss: 1.049   lengthscale: 2.856   noise: 0.183
Iter 14/500 - Loss: 1.031   lengthscale: 3.095   noise: 0.188
Iter 15/500 - Loss: 1.020   lengthscale: 3.340   noise: 0.191
Iter 16/500 - Loss: 1.014   lengthscale: 3.589   noise: 0.189
Iter 17/500 - Loss: 1.007   lengthscale: 3.842   noise: 0.183
Iter 18/500 - Loss: 0.997   lengthscale: 4.098   noise: 0.173
Iter 19/500 - Loss: 0.983   lengthscale: 4.356   noise: 0.159
Iter 20/500 - Loss: 0.968   lengthscale: 4.616   noise: 0.144
Iter 21/500 - Loss: 0.951   lengthscale: 4.877   noise: 0.128
Iter 22/500 - Loss: 0.933   lengthscale: 5.139   noise: 0.112
Iter 23/500 - Loss: 0.913   lengthscale: 5.402   noise: 0.097
Iter 24/500 - Loss: 0.891   lengthscale: 5.667   noise: 0.083
Iter 25/500 - Loss: 0.870   lengthscale: 5.933   noise: 0.070
Iter 26/500 - Loss: 0.850   lengthscale: 6.202   noise: 0.059
Iter 27/500 - Loss: 0.832   lengthscale: 6.474   noise: 0.050
Iter 28/500 - Loss: 0.814   lengthscale: 6.748   noise: 0.042
Iter 29/500 - Loss: 0.794   lengthscale: 7.025   noise: 0.035
Iter 30/500 - Loss: 0.775   lengthscale: 7.304   noise: 0.029
Iter 31/500 - Loss: 0.756   lengthscale: 7.586   noise: 0.024
Iter 32/500 - Loss: 0.737   lengthscale: 7.868   noise: 0.020
Iter 33/500 - Loss: 0.719   lengthscale: 8.152   noise: 0.017
Iter 34/500 - Loss: 0.701   lengthscale: 8.437   noise: 0.014
Iter 35/500 - Loss: 0.684   lengthscale: 8.722   noise: 0.012
Iter 36/500 - Loss: 0.668   lengthscale: 9.007   noise: 0.010
Iter 37/500 - Loss: 0.653   lengthscale: 9.291   noise: 0.008
Iter 38/500 - Loss: 0.637   lengthscale: 9.575   noise: 0.007
Iter 39/500 - Loss: 0.621   lengthscale: 9.859   noise: 0.006
Iter 40/500 - Loss: 0.606   lengthscale: 10.142   noise: 0.005
Iter 41/500 - Loss: 0.591   lengthscale: 10.424   noise: 0.005
Iter 42/500 - Loss: 0.577   lengthscale: 10.705   noise: 0.004
Iter 43/500 - Loss: 0.564   lengthscale: 10.985   noise: 0.003
Iter 44/500 - Loss: 0.551   lengthscale: 11.263   noise: 0.003
Iter 45/500 - Loss: 0.539   lengthscale: 11.540   noise: 0.003
Iter 46/500 - Loss: 0.526   lengthscale: 11.815   noise: 0.002
Iter 47/500 - Loss: 0.514   lengthscale: 12.088   noise: 0.002
Iter 48/500 - Loss: 0.503   lengthscale: 12.359   noise: 0.002
Iter 49/500 - Loss: 0.492   lengthscale: 12.627   noise: 0.002
Iter 50/500 - Loss: 0.482   lengthscale: 12.892   noise: 0.001
Iter 51/500 - Loss: 0.472   lengthscale: 13.154   noise: 0.001
Iter 52/500 - Loss: 0.462   lengthscale: 13.412   noise: 0.001
Iter 53/500 - Loss: 0.453   lengthscale: 13.668   noise: 0.001
Iter 54/500 - Loss: 0.445   lengthscale: 13.919   noise: 0.001
Iter 55/500 - Loss: 0.437   lengthscale: 14.167   noise: 0.001
Iter 56/500 - Loss: 0.429   lengthscale: 14.410   noise: 0.001
Iter 57/500 - Loss: 0.422   lengthscale: 14.649   noise: 0.001
Iter 58/500 - Loss: 0.415   lengthscale: 14.883   noise: 0.001
Iter 59/500 - Loss: 0.409   lengthscale: 15.111   noise: 0.001
Iter 60/500 - Loss: 0.403   lengthscale: 15.335   noise: 0.001
Iter 61/500 - Loss: 0.398   lengthscale: 15.552   noise: 0.001
Iter 62/500 - Loss: 0.393   lengthscale: 15.764   noise: 0.001
Iter 63/500 - Loss: 0.389   lengthscale: 15.970   noise: 0.001
Iter 64/500 - Loss: 0.385   lengthscale: 16.170   noise: 0.000
Iter 65/500 - Loss: 0.381   lengthscale: 16.363   noise: 0.000
Iter 66/500 - Loss: 0.378   lengthscale: 16.551   noise: 0.000
Iter 67/500 - Loss: 0.375   lengthscale: 16.731   noise: 0.000
Iter 68/500 - Loss: 0.373   lengthscale: 16.905   noise: 0.000
Iter 69/500 - Loss: 0.371   lengthscale: 17.072   noise: 0.000
Iter 70/500 - Loss: 0.369   lengthscale: 17.231   noise: 0.000
Iter 71/500 - Loss: 0.367   lengthscale: 17.383   noise: 0.000
Iter 72/500 - Loss: 0.366   lengthscale: 17.527   noise: 0.000
Iter 73/500 - Loss: 0.365   lengthscale: 17.662   noise: 0.000
Iter 74/500 - Loss: 0.364   lengthscale: 17.791   noise: 0.000
Iter 75/500 - Loss: 0.363   lengthscale: 17.911   noise: 0.000
Iter 76/500 - Loss: 0.363   lengthscale: 18.024   noise: 0.000
Iter 77/500 - Loss: 0.362   lengthscale: 18.129   noise: 0.000
Iter 78/500 - Loss: 0.362   lengthscale: 18.227   noise: 0.000
Iter 79/500 - Loss: 0.362   lengthscale: 18.317   noise: 0.000
Iter 80/500 - Loss: 0.362   lengthscale: 18.400   noise: 0.000
Iter 81/500 - Loss: 0.362   lengthscale: 18.477   noise: 0.000
Iter 82/500 - Loss: 0.362   lengthscale: 18.546   noise: 0.000
Iter 83/500 - Loss: 0.362   lengthscale: 18.608   noise: 0.000
Iter 84/500 - Loss: 0.362   lengthscale: 18.664   noise: 0.000
Iter 85/500 - Loss: 0.362   lengthscale: 18.712   noise: 0.000
Iter 86/500 - Loss: 0.362   lengthscale: 18.755   noise: 0.000
Iter 87/500 - Loss: 0.362   lengthscale: 18.791   noise: 0.000
Iter 88/500 - Loss: 0.362   lengthscale: 18.822   noise: 0.000
Iter 89/500 - Loss: 0.362   lengthscale: 18.848   noise: 0.000
Iter 90/500 - Loss: 0.363   lengthscale: 18.868   noise: 0.000
Iter 91/500 - Loss: 0.363   lengthscale: 18.884   noise: 0.000
Iter 92/500 - Loss: 0.363   lengthscale: 18.896   noise: 0.000
Iter 93/500 - Loss: 0.363   lengthscale: 18.904   noise: 0.000
Iter 94/500 - Loss: 0.363   lengthscale: 18.909   noise: 0.000
Iter 95/500 - Loss: 0.363   lengthscale: 18.910   noise: 0.000
Iter 96/500 - Loss: 0.363   lengthscale: 18.907   noise: 0.000
Iter 97/500 - Loss: 0.363   lengthscale: 18.902   noise: 0.000
Iter 98/500 - Loss: 0.363   lengthscale: 18.894   noise: 0.000
Iter 99/500 - Loss: 0.362   lengthscale: 18.884   noise: 0.000
Iter 100/500 - Loss: 0.362   lengthscale: 18.873   noise: 0.000
Iter 101/500 - Loss: 0.362   lengthscale: 18.859   noise: 0.000
Iter 102/500 - Loss: 0.362   lengthscale: 18.844   noise: 0.000
Iter 103/500 - Loss: 0.362   lengthscale: 18.828   noise: 0.000
Iter 104/500 - Loss: 0.362   lengthscale: 18.811   noise: 0.000
Iter 105/500 - Loss: 0.362   lengthscale: 18.794   noise: 0.000
Iter 106/500 - Loss: 0.362   lengthscale: 18.776   noise: 0.000
Iter 107/500 - Loss: 0.362   lengthscale: 18.758   noise: 0.000
Iter 108/500 - Loss: 0.362   lengthscale: 18.740   noise: 0.000
Iter 109/500 - Loss: 0.362   lengthscale: 18.722   noise: 0.000
Iter 110/500 - Loss: 0.362   lengthscale: 18.704   noise: 0.000
Iter 111/500 - Loss: 0.362   lengthscale: 18.687   noise: 0.000
Iter 112/500 - Loss: 0.362   lengthscale: 18.670   noise: 0.000
Iter 113/500 - Loss: 0.362   lengthscale: 18.653   noise: 0.000
Iter 114/500 - Loss: 0.362   lengthscale: 18.638   noise: 0.000
Iter 115/500 - Loss: 0.362   lengthscale: 18.622   noise: 0.000
Iter 116/500 - Loss: 0.362   lengthscale: 18.608   noise: 0.000
Iter 117/500 - Loss: 0.362   lengthscale: 18.595   noise: 0.000
Iter 118/500 - Loss: 0.362   lengthscale: 18.583   noise: 0.000
Iter 119/500 - Loss: 0.362   lengthscale: 18.572   noise: 0.000
Iter 120/500 - Loss: 0.362   lengthscale: 18.561   noise: 0.000
Iter 121/500 - Loss: 0.362   lengthscale: 18.552   noise: 0.000
Iter 122/500 - Loss: 0.362   lengthscale: 18.543   noise: 0.000
Iter 123/500 - Loss: 0.362   lengthscale: 18.536   noise: 0.000
Iter 124/500 - Loss: 0.362   lengthscale: 18.529   noise: 0.000
Iter 125/500 - Loss: 0.362   lengthscale: 18.524   noise: 0.000
Iter 126/500 - Loss: 0.362   lengthscale: 18.519   noise: 0.000
Iter 127/500 - Loss: 0.362   lengthscale: 18.516   noise: 0.000
Iter 128/500 - Loss: 0.362   lengthscale: 18.513   noise: 0.000
Iter 129/500 - Loss: 0.362   lengthscale: 18.511   noise: 0.000
Iter 130/500 - Loss: 0.362   lengthscale: 18.510   noise: 0.000
Iter 131/500 - Loss: 0.362   lengthscale: 18.509   noise: 0.000
Iter 132/500 - Loss: 0.362   lengthscale: 18.510   noise: 0.000
Iter 133/500 - Loss: 0.362   lengthscale: 18.511   noise: 0.000
Iter 134/500 - Loss: 0.362   lengthscale: 18.512   noise: 0.000
Iter 135/500 - Loss: 0.362   lengthscale: 18.514   noise: 0.000
Iter 136/500 - Loss: 0.362   lengthscale: 18.517   noise: 0.000
Iter 137/500 - Loss: 0.362   lengthscale: 18.520   noise: 0.000
Iter 138/500 - Loss: 0.362   lengthscale: 18.524   noise: 0.000
Iter 139/500 - Loss: 0.362   lengthscale: 18.528   noise: 0.000
Iter 140/500 - Loss: 0.362   lengthscale: 18.532   noise: 0.000
Iter 141/500 - Loss: 0.362   lengthscale: 18.537   noise: 0.000
Iter 142/500 - Loss: 0.362   lengthscale: 18.542   noise: 0.000
Iter 143/500 - Loss: 0.362   lengthscale: 18.547   noise: 0.000
Iter 144/500 - Loss: 0.362   lengthscale: 18.552   noise: 0.000
Iter 145/500 - Loss: 0.362   lengthscale: 18.558   noise: 0.000
Iter 146/500 - Loss: 0.362   lengthscale: 18.563   noise: 0.000
Iter 147/500 - Loss: 0.362   lengthscale: 18.569   noise: 0.000
Iter 148/500 - Loss: 0.362   lengthscale: 18.574   noise: 0.000
Iter 149/500 - Loss: 0.362   lengthscale: 18.580   noise: 0.000
Iter 150/500 - Loss: 0.362   lengthscale: 18.586   noise: 0.000
Iter 151/500 - Loss: 0.362   lengthscale: 18.592   noise: 0.000
Iter 152/500 - Loss: 0.362   lengthscale: 18.597   noise: 0.000
Iter 153/500 - Loss: 0.362   lengthscale: 18.603   noise: 0.000
Iter 154/500 - Loss: 0.362   lengthscale: 18.608   noise: 0.000
Iter 155/500 - Loss: 0.362   lengthscale: 18.614   noise: 0.000
Iter 156/500 - Loss: 0.362   lengthscale: 18.619   noise: 0.000
Iter 157/500 - Loss: 0.362   lengthscale: 18.625   noise: 0.000
Iter 158/500 - Loss: 0.362   lengthscale: 18.630   noise: 0.000
Iter 159/500 - Loss: 0.362   lengthscale: 18.635   noise: 0.000
Iter 160/500 - Loss: 0.362   lengthscale: 18.640   noise: 0.000
Iter 161/500 - Loss: 0.362   lengthscale: 18.645   noise: 0.000
Iter 162/500 - Loss: 0.362   lengthscale: 18.650   noise: 0.000
Iter 163/500 - Loss: 0.362   lengthscale: 18.655   noise: 0.000
Iter 164/500 - Loss: 0.362   lengthscale: 18.659   noise: 0.000
Iter 165/500 - Loss: 0.361   lengthscale: 18.664   noise: 0.000
Iter 166/500 - Loss: 0.361   lengthscale: 18.669   noise: 0.000
Iter 167/500 - Loss: 0.361   lengthscale: 18.673   noise: 0.000
Iter 168/500 - Loss: 0.361   lengthscale: 18.678   noise: 0.000
Iter 169/500 - Loss: 0.361   lengthscale: 18.682   noise: 0.000
Iter 170/500 - Loss: 0.361   lengthscale: 18.686   noise: 0.000
Iter 171/500 - Loss: 0.361   lengthscale: 18.691   noise: 0.000
Iter 172/500 - Loss: 0.361   lengthscale: 18.695   noise: 0.000
Iter 173/500 - Loss: 0.361   lengthscale: 18.699   noise: 0.000
Iter 174/500 - Loss: 0.361   lengthscale: 18.704   noise: 0.000
Iter 175/500 - Loss: 0.361   lengthscale: 18.708   noise: 0.000
Iter 176/500 - Loss: 0.361   lengthscale: 18.713   noise: 0.000
Iter 177/500 - Loss: 0.361   lengthscale: 18.717   noise: 0.000
Iter 178/500 - Loss: 0.361   lengthscale: 18.721   noise: 0.000
Iter 179/500 - Loss: 0.361   lengthscale: 18.726   noise: 0.000
Iter 180/500 - Loss: 0.361   lengthscale: 18.731   noise: 0.000
Iter 181/500 - Loss: 0.361   lengthscale: 18.735   noise: 0.000
Iter 182/500 - Loss: 0.361   lengthscale: 18.740   noise: 0.000
Iter 183/500 - Loss: 0.361   lengthscale: 18.745   noise: 0.000
Iter 184/500 - Loss: 0.361   lengthscale: 18.750   noise: 0.000
Iter 185/500 - Loss: 0.361   lengthscale: 18.755   noise: 0.000
Iter 186/500 - Loss: 0.361   lengthscale: 18.760   noise: 0.000
Iter 187/500 - Loss: 0.361   lengthscale: 18.765   noise: 0.000
Iter 188/500 - Loss: 0.361   lengthscale: 18.771   noise: 0.000
Iter 189/500 - Loss: 0.361   lengthscale: 18.776   noise: 0.000
Iter 190/500 - Loss: 0.361   lengthscale: 18.782   noise: 0.000
Iter 191/500 - Loss: 0.361   lengthscale: 18.788   noise: 0.000
Iter 192/500 - Loss: 0.361   lengthscale: 18.794   noise: 0.000
Iter 193/500 - Loss: 0.361   lengthscale: 18.800   noise: 0.000
Iter 194/500 - Loss: 0.361   lengthscale: 18.806   noise: 0.000
Iter 195/500 - Loss: 0.361   lengthscale: 18.813   noise: 0.000
Iter 196/500 - Loss: 0.361   lengthscale: 18.820   noise: 0.000
Iter 197/500 - Loss: 0.361   lengthscale: 18.827   noise: 0.000
Iter 198/500 - Loss: 0.361   lengthscale: 18.834   noise: 0.000
Iter 199/500 - Loss: 0.361   lengthscale: 18.841   noise: 0.000
Iter 200/500 - Loss: 0.361   lengthscale: 18.848   noise: 0.000
Iter 201/500 - Loss: 0.361   lengthscale: 18.856   noise: 0.000
Iter 202/500 - Loss: 0.361   lengthscale: 18.863   noise: 0.000
Iter 203/500 - Loss: 0.361   lengthscale: 18.871   noise: 0.000
Iter 204/500 - Loss: 0.361   lengthscale: 18.880   noise: 0.000
Iter 205/500 - Loss: 0.361   lengthscale: 18.888   noise: 0.000
Iter 206/500 - Loss: 0.361   lengthscale: 18.897   noise: 0.001
Iter 207/500 - Loss: 0.361   lengthscale: 18.905   noise: 0.001
Iter 208/500 - Loss: 0.361   lengthscale: 18.914   noise: 0.001
Iter 209/500 - Loss: 0.361   lengthscale: 18.924   noise: 0.001
Iter 210/500 - Loss: 0.361   lengthscale: 18.933   noise: 0.001
Iter 211/500 - Loss: 0.361   lengthscale: 18.943   noise: 0.001
Iter 212/500 - Loss: 0.361   lengthscale: 18.953   noise: 0.001
Iter 213/500 - Loss: 0.361   lengthscale: 18.963   noise: 0.001
Iter 214/500 - Loss: 0.361   lengthscale: 18.974   noise: 0.001
Iter 215/500 - Loss: 0.361   lengthscale: 18.985   noise: 0.001
Iter 216/500 - Loss: 0.361   lengthscale: 18.996   noise: 0.001
Iter 217/500 - Loss: 0.361   lengthscale: 19.007   noise: 0.001
Iter 218/500 - Loss: 0.361   lengthscale: 19.019   noise: 0.001
Iter 219/500 - Loss: 0.361   lengthscale: 19.030   noise: 0.001
Iter 220/500 - Loss: 0.361   lengthscale: 19.043   noise: 0.001
Iter 221/500 - Loss: 0.360   lengthscale: 19.055   noise: 0.001
Iter 222/500 - Loss: 0.360   lengthscale: 19.068   noise: 0.001
Iter 223/500 - Loss: 0.360   lengthscale: 19.081   noise: 0.001
Iter 224/500 - Loss: 0.360   lengthscale: 19.095   noise: 0.001
Iter 225/500 - Loss: 0.360   lengthscale: 19.109   noise: 0.001
Iter 226/500 - Loss: 0.360   lengthscale: 19.123   noise: 0.001
Iter 227/500 - Loss: 0.360   lengthscale: 19.138   noise: 0.001
Iter 228/500 - Loss: 0.360   lengthscale: 19.153   noise: 0.001
Iter 229/500 - Loss: 0.360   lengthscale: 19.169   noise: 0.001
Iter 230/500 - Loss: 0.360   lengthscale: 19.185   noise: 0.001
Iter 231/500 - Loss: 0.360   lengthscale: 19.202   noise: 0.001
Iter 232/500 - Loss: 0.360   lengthscale: 19.219   noise: 0.001
Iter 233/500 - Loss: 0.360   lengthscale: 19.236   noise: 0.001
Iter 234/500 - Loss: 0.360   lengthscale: 19.254   noise: 0.001
Iter 235/500 - Loss: 0.360   lengthscale: 19.273   noise: 0.001
Iter 236/500 - Loss: 0.360   lengthscale: 19.292   noise: 0.001
Iter 237/500 - Loss: 0.360   lengthscale: 19.311   noise: 0.001
Iter 238/500 - Loss: 0.359   lengthscale: 19.332   noise: 0.001
Iter 239/500 - Loss: 0.359   lengthscale: 19.353   noise: 0.001
Iter 240/500 - Loss: 0.359   lengthscale: 19.374   noise: 0.001
Iter 241/500 - Loss: 0.359   lengthscale: 19.396   noise: 0.001
Iter 242/500 - Loss: 0.359   lengthscale: 19.419   noise: 0.001
Iter 243/500 - Loss: 0.359   lengthscale: 19.443   noise: 0.001
Iter 244/500 - Loss: 0.359   lengthscale: 19.467   noise: 0.001
Iter 245/500 - Loss: 0.359   lengthscale: 19.492   noise: 0.001
Iter 246/500 - Loss: 0.359   lengthscale: 19.518   noise: 0.001
Iter 247/500 - Loss: 0.358   lengthscale: 19.545   noise: 0.001
Iter 248/500 - Loss: 0.358   lengthscale: 19.572   noise: 0.001
Iter 249/500 - Loss: 0.358   lengthscale: 19.600   noise: 0.001
Iter 250/500 - Loss: 0.358   lengthscale: 19.630   noise: 0.001
Iter 251/500 - Loss: 0.358   lengthscale: 19.660   noise: 0.001
Iter 252/500 - Loss: 0.358   lengthscale: 19.690   noise: 0.001
Iter 253/500 - Loss: 0.358   lengthscale: 19.722   noise: 0.001
Iter 254/500 - Loss: 0.357   lengthscale: 19.755   noise: 0.001
Iter 255/500 - Loss: 0.357   lengthscale: 19.789   noise: 0.001
Iter 256/500 - Loss: 0.357   lengthscale: 19.823   noise: 0.001
Iter 257/500 - Loss: 0.357   lengthscale: 19.859   noise: 0.001
Iter 258/500 - Loss: 0.357   lengthscale: 19.896   noise: 0.001
Iter 259/500 - Loss: 0.357   lengthscale: 19.933   noise: 0.001
Iter 260/500 - Loss: 0.356   lengthscale: 19.972   noise: 0.001
Iter 261/500 - Loss: 0.356   lengthscale: 20.012   noise: 0.001
Iter 262/500 - Loss: 0.356   lengthscale: 20.052   noise: 0.001
Iter 263/500 - Loss: 0.356   lengthscale: 20.094   noise: 0.002
Iter 264/500 - Loss: 0.356   lengthscale: 20.136   noise: 0.002
Iter 265/500 - Loss: 0.355   lengthscale: 20.179   noise: 0.002
Iter 266/500 - Loss: 0.355   lengthscale: 20.224   noise: 0.002
Iter 267/500 - Loss: 0.355   lengthscale: 20.269   noise: 0.002
Iter 268/500 - Loss: 0.355   lengthscale: 20.315   noise: 0.002
Iter 269/500 - Loss: 0.355   lengthscale: 20.362   noise: 0.002
Iter 270/500 - Loss: 0.354   lengthscale: 20.409   noise: 0.002
Iter 271/500 - Loss: 0.354   lengthscale: 20.458   noise: 0.002
Iter 272/500 - Loss: 0.354   lengthscale: 20.507   noise: 0.002
Iter 273/500 - Loss: 0.354   lengthscale: 20.557   noise: 0.002
Iter 274/500 - Loss: 0.354   lengthscale: 20.607   noise: 0.002
Iter 275/500 - Loss: 0.353   lengthscale: 20.658   noise: 0.002
Iter 276/500 - Loss: 0.353   lengthscale: 20.709   noise: 0.002
Iter 277/500 - Loss: 0.353   lengthscale: 20.761   noise: 0.002
Iter 278/500 - Loss: 0.353   lengthscale: 20.813   noise: 0.002
Iter 279/500 - Loss: 0.353   lengthscale: 20.866   noise: 0.002
Iter 280/500 - Loss: 0.352   lengthscale: 20.919   noise: 0.002
Iter 281/500 - Loss: 0.352   lengthscale: 20.972   noise: 0.002
Iter 282/500 - Loss: 0.352   lengthscale: 21.025   noise: 0.002
Iter 283/500 - Loss: 0.352   lengthscale: 21.078   noise: 0.002
Iter 284/500 - Loss: 0.352   lengthscale: 21.132   noise: 0.002
Iter 285/500 - Loss: 0.352   lengthscale: 21.185   noise: 0.002
Iter 286/500 - Loss: 0.351   lengthscale: 21.239   noise: 0.003
Iter 287/500 - Loss: 0.351   lengthscale: 21.292   noise: 0.003
Iter 288/500 - Loss: 0.351   lengthscale: 21.345   noise: 0.003
Iter 289/500 - Loss: 0.351   lengthscale: 21.398   noise: 0.003
Iter 290/500 - Loss: 0.351   lengthscale: 21.451   noise: 0.003
Iter 291/500 - Loss: 0.350   lengthscale: 21.504   noise: 0.003
Iter 292/500 - Loss: 0.350   lengthscale: 21.556   noise: 0.003
Iter 293/500 - Loss: 0.350   lengthscale: 21.608   noise: 0.003
Iter 294/500 - Loss: 0.350   lengthscale: 21.661   noise: 0.003
Iter 295/500 - Loss: 0.350   lengthscale: 21.712   noise: 0.003
Iter 296/500 - Loss: 0.350   lengthscale: 21.764   noise: 0.003
Iter 297/500 - Loss: 0.349   lengthscale: 21.815   noise: 0.003
Iter 298/500 - Loss: 0.349   lengthscale: 21.867   noise: 0.003
Iter 299/500 - Loss: 0.349   lengthscale: 21.918   noise: 0.003
Iter 300/500 - Loss: 0.349   lengthscale: 21.968   noise: 0.003
Iter 301/500 - Loss: 0.349   lengthscale: 22.019   noise: 0.003
Iter 302/500 - Loss: 0.349   lengthscale: 22.070   noise: 0.003
Iter 303/500 - Loss: 0.348   lengthscale: 22.120   noise: 0.003
Iter 304/500 - Loss: 0.348   lengthscale: 22.170   noise: 0.003
Iter 305/500 - Loss: 0.348   lengthscale: 22.220   noise: 0.003
Iter 306/500 - Loss: 0.348   lengthscale: 22.270   noise: 0.003
Iter 307/500 - Loss: 0.348   lengthscale: 22.320   noise: 0.003
Iter 308/500 - Loss: 0.348   lengthscale: 22.370   noise: 0.003
Iter 309/500 - Loss: 0.347   lengthscale: 22.419   noise: 0.004
Iter 310/500 - Loss: 0.347   lengthscale: 22.469   noise: 0.004
Iter 311/500 - Loss: 0.347   lengthscale: 22.519   noise: 0.004
Iter 312/500 - Loss: 0.347   lengthscale: 22.568   noise: 0.004
Iter 313/500 - Loss: 0.347   lengthscale: 22.618   noise: 0.004
Iter 314/500 - Loss: 0.347   lengthscale: 22.668   noise: 0.004
Iter 315/500 - Loss: 0.347   lengthscale: 22.717   noise: 0.004
Iter 316/500 - Loss: 0.346   lengthscale: 22.767   noise: 0.004
Iter 317/500 - Loss: 0.346   lengthscale: 22.817   noise: 0.004
Iter 318/500 - Loss: 0.346   lengthscale: 22.867   noise: 0.004
Iter 319/500 - Loss: 0.346   lengthscale: 22.916   noise: 0.004
Iter 320/500 - Loss: 0.346   lengthscale: 22.966   noise: 0.004
Iter 321/500 - Loss: 0.346   lengthscale: 23.016   noise: 0.004
Iter 322/500 - Loss: 0.346   lengthscale: 23.066   noise: 0.004
Iter 323/500 - Loss: 0.345   lengthscale: 23.116   noise: 0.004
Iter 324/500 - Loss: 0.345   lengthscale: 23.166   noise: 0.004
Iter 325/500 - Loss: 0.345   lengthscale: 23.217   noise: 0.004
Iter 326/500 - Loss: 0.345   lengthscale: 23.267   noise: 0.004
Iter 327/500 - Loss: 0.345   lengthscale: 23.318   noise: 0.004
Iter 328/500 - Loss: 0.345   lengthscale: 23.369   noise: 0.004
Iter 329/500 - Loss: 0.344   lengthscale: 23.420   noise: 0.005
Iter 330/500 - Loss: 0.344   lengthscale: 23.471   noise: 0.005
Iter 331/500 - Loss: 0.344   lengthscale: 23.523   noise: 0.005
Iter 332/500 - Loss: 0.344   lengthscale: 23.574   noise: 0.005
Iter 333/500 - Loss: 0.344   lengthscale: 23.626   noise: 0.005
Iter 334/500 - Loss: 0.344   lengthscale: 23.679   noise: 0.005
Iter 335/500 - Loss: 0.343   lengthscale: 23.731   noise: 0.005
Iter 336/500 - Loss: 0.343   lengthscale: 23.784   noise: 0.005
Iter 337/500 - Loss: 0.343   lengthscale: 23.837   noise: 0.005
Iter 338/500 - Loss: 0.343   lengthscale: 23.891   noise: 0.005
Iter 339/500 - Loss: 0.343   lengthscale: 23.945   noise: 0.005
Iter 340/500 - Loss: 0.343   lengthscale: 23.999   noise: 0.005
Iter 341/500 - Loss: 0.342   lengthscale: 24.053   noise: 0.005
Iter 342/500 - Loss: 0.342   lengthscale: 24.108   noise: 0.005
Iter 343/500 - Loss: 0.342   lengthscale: 24.164   noise: 0.005
Iter 344/500 - Loss: 0.342   lengthscale: 24.219   noise: 0.005
Iter 345/500 - Loss: 0.342   lengthscale: 24.276   noise: 0.005
Iter 346/500 - Loss: 0.341   lengthscale: 24.332   noise: 0.005
Iter 347/500 - Loss: 0.341   lengthscale: 24.389   noise: 0.006
Iter 348/500 - Loss: 0.341   lengthscale: 24.447   noise: 0.006
Iter 349/500 - Loss: 0.341   lengthscale: 24.505   noise: 0.006
Iter 350/500 - Loss: 0.341   lengthscale: 24.563   noise: 0.006
Iter 351/500 - Loss: 0.340   lengthscale: 24.622   noise: 0.006
Iter 352/500 - Loss: 0.340   lengthscale: 24.681   noise: 0.006
Iter 353/500 - Loss: 0.340   lengthscale: 24.741   noise: 0.006
Iter 354/500 - Loss: 0.340   lengthscale: 24.801   noise: 0.006
Iter 355/500 - Loss: 0.340   lengthscale: 24.862   noise: 0.006
Iter 356/500 - Loss: 0.339   lengthscale: 24.924   noise: 0.006
Iter 357/500 - Loss: 0.339   lengthscale: 24.985   noise: 0.006
Iter 358/500 - Loss: 0.339   lengthscale: 25.048   noise: 0.006
Iter 359/500 - Loss: 0.339   lengthscale: 25.111   noise: 0.006
Iter 360/500 - Loss: 0.338   lengthscale: 25.174   noise: 0.006
Iter 361/500 - Loss: 0.338   lengthscale: 25.239   noise: 0.006
Iter 362/500 - Loss: 0.338   lengthscale: 25.303   noise: 0.006
Iter 363/500 - Loss: 0.338   lengthscale: 25.369   noise: 0.007
Iter 364/500 - Loss: 0.337   lengthscale: 25.435   noise: 0.007
Iter 365/500 - Loss: 0.337   lengthscale: 25.501   noise: 0.007
Iter 366/500 - Loss: 0.337   lengthscale: 25.568   noise: 0.007
Iter 367/500 - Loss: 0.337   lengthscale: 25.636   noise: 0.007
Iter 368/500 - Loss: 0.336   lengthscale: 25.704   noise: 0.007
Iter 369/500 - Loss: 0.336   lengthscale: 25.773   noise: 0.007
Iter 370/500 - Loss: 0.336   lengthscale: 25.842   noise: 0.007
Iter 371/500 - Loss: 0.335   lengthscale: 25.913   noise: 0.007
Iter 372/500 - Loss: 0.335   lengthscale: 25.983   noise: 0.007
Iter 373/500 - Loss: 0.335   lengthscale: 26.055   noise: 0.007
Iter 374/500 - Loss: 0.335   lengthscale: 26.127   noise: 0.007
Iter 375/500 - Loss: 0.334   lengthscale: 26.199   noise: 0.007
Iter 376/500 - Loss: 0.334   lengthscale: 26.273   noise: 0.007
Iter 377/500 - Loss: 0.334   lengthscale: 26.347   noise: 0.007
Iter 378/500 - Loss: 0.333   lengthscale: 26.421   noise: 0.008
Iter 379/500 - Loss: 0.333   lengthscale: 26.496   noise: 0.008
Iter 380/500 - Loss: 0.333   lengthscale: 26.572   noise: 0.008
Iter 381/500 - Loss: 0.332   lengthscale: 26.649   noise: 0.008
Iter 382/500 - Loss: 0.332   lengthscale: 26.726   noise: 0.008
Iter 383/500 - Loss: 0.332   lengthscale: 26.804   noise: 0.008
Iter 384/500 - Loss: 0.331   lengthscale: 26.882   noise: 0.008
Iter 385/500 - Loss: 0.331   lengthscale: 26.961   noise: 0.008
Iter 386/500 - Loss: 0.331   lengthscale: 27.041   noise: 0.008
Iter 387/500 - Loss: 0.330   lengthscale: 27.121   noise: 0.008
Iter 388/500 - Loss: 0.330   lengthscale: 27.202   noise: 0.008
Iter 389/500 - Loss: 0.330   lengthscale: 27.284   noise: 0.008
Iter 390/500 - Loss: 0.329   lengthscale: 27.366   noise: 0.008
Iter 391/500 - Loss: 0.329   lengthscale: 27.449   noise: 0.008
Iter 392/500 - Loss: 0.328   lengthscale: 27.532   noise: 0.008
Iter 393/500 - Loss: 0.328   lengthscale: 27.616   noise: 0.009
Iter 394/500 - Loss: 0.328   lengthscale: 27.701   noise: 0.009
Iter 395/500 - Loss: 0.327   lengthscale: 27.786   noise: 0.009
Iter 396/500 - Loss: 0.327   lengthscale: 27.872   noise: 0.009
Iter 397/500 - Loss: 0.326   lengthscale: 27.959   noise: 0.009
Iter 398/500 - Loss: 0.326   lengthscale: 28.046   noise: 0.009
Iter 399/500 - Loss: 0.326   lengthscale: 28.133   noise: 0.009
Iter 400/500 - Loss: 0.325   lengthscale: 28.222   noise: 0.009
Iter 401/500 - Loss: 0.325   lengthscale: 28.311   noise: 0.009
Iter 402/500 - Loss: 0.324   lengthscale: 28.400   noise: 0.009
Iter 403/500 - Loss: 0.324   lengthscale: 28.490   noise: 0.009
Iter 404/500 - Loss: 0.323   lengthscale: 28.580   noise: 0.009
Iter 405/500 - Loss: 0.323   lengthscale: 28.672   noise: 0.009
Iter 406/500 - Loss: 0.323   lengthscale: 28.763   noise: 0.009
Iter 407/500 - Loss: 0.322   lengthscale: 28.855   noise: 0.009
Iter 408/500 - Loss: 0.322   lengthscale: 28.948   noise: 0.010
Iter 409/500 - Loss: 0.321   lengthscale: 29.041   noise: 0.010
Iter 410/500 - Loss: 0.321   lengthscale: 29.135   noise: 0.010
Iter 411/500 - Loss: 0.320   lengthscale: 29.229   noise: 0.010
Iter 412/500 - Loss: 0.320   lengthscale: 29.324   noise: 0.010
Iter 413/500 - Loss: 0.319   lengthscale: 29.419   noise: 0.010
Iter 414/500 - Loss: 0.319   lengthscale: 29.515   noise: 0.010
Iter 415/500 - Loss: 0.318   lengthscale: 29.611   noise: 0.010
Iter 416/500 - Loss: 0.318   lengthscale: 29.708   noise: 0.010
Iter 417/500 - Loss: 0.317   lengthscale: 29.805   noise: 0.010
Iter 418/500 - Loss: 0.317   lengthscale: 29.903   noise: 0.010
Iter 419/500 - Loss: 0.316   lengthscale: 30.001   noise: 0.010
Iter 420/500 - Loss: 0.316   lengthscale: 30.099   noise: 0.010
Iter 421/500 - Loss: 0.315   lengthscale: 30.198   noise: 0.010
Iter 422/500 - Loss: 0.315   lengthscale: 30.297   noise: 0.010
Iter 423/500 - Loss: 0.314   lengthscale: 30.397   noise: 0.010
Iter 424/500 - Loss: 0.314   lengthscale: 30.497   noise: 0.010
Iter 425/500 - Loss: 0.313   lengthscale: 30.598   noise: 0.011
Iter 426/500 - Loss: 0.313   lengthscale: 30.699   noise: 0.011
Iter 427/500 - Loss: 0.312   lengthscale: 30.800   noise: 0.011
Iter 428/500 - Loss: 0.312   lengthscale: 30.901   noise: 0.011
Iter 429/500 - Loss: 0.311   lengthscale: 31.003   noise: 0.011
Iter 430/500 - Loss: 0.311   lengthscale: 31.106   noise: 0.011
Iter 431/500 - Loss: 0.310   lengthscale: 31.209   noise: 0.011
Iter 432/500 - Loss: 0.310   lengthscale: 31.312   noise: 0.011
Iter 433/500 - Loss: 0.309   lengthscale: 31.415   noise: 0.011
Iter 434/500 - Loss: 0.309   lengthscale: 31.519   noise: 0.011
Iter 435/500 - Loss: 0.308   lengthscale: 31.623   noise: 0.011
Iter 436/500 - Loss: 0.308   lengthscale: 31.727   noise: 0.011
Iter 437/500 - Loss: 0.307   lengthscale: 31.832   noise: 0.011
Iter 438/500 - Loss: 0.306   lengthscale: 31.936   noise: 0.011
Iter 439/500 - Loss: 0.306   lengthscale: 32.042   noise: 0.011
Iter 440/500 - Loss: 0.305   lengthscale: 32.147   noise: 0.011
Iter 441/500 - Loss: 0.305   lengthscale: 32.253   noise: 0.011
Iter 442/500 - Loss: 0.304   lengthscale: 32.359   noise: 0.011
Iter 443/500 - Loss: 0.304   lengthscale: 32.465   noise: 0.011
Iter 444/500 - Loss: 0.303   lengthscale: 32.571   noise: 0.012
Iter 445/500 - Loss: 0.303   lengthscale: 32.678   noise: 0.012
Iter 446/500 - Loss: 0.302   lengthscale: 32.785   noise: 0.012
Iter 447/500 - Loss: 0.301   lengthscale: 32.892   noise: 0.012
Iter 448/500 - Loss: 0.301   lengthscale: 32.999   noise: 0.012
Iter 449/500 - Loss: 0.300   lengthscale: 33.106   noise: 0.012
Iter 450/500 - Loss: 0.300   lengthscale: 33.214   noise: 0.012
Iter 451/500 - Loss: 0.299   lengthscale: 33.322   noise: 0.012
Iter 452/500 - Loss: 0.299   lengthscale: 33.430   noise: 0.012
Iter 453/500 - Loss: 0.298   lengthscale: 33.538   noise: 0.012
Iter 454/500 - Loss: 0.298   lengthscale: 33.646   noise: 0.012
Iter 455/500 - Loss: 0.297   lengthscale: 33.755   noise: 0.012
Iter 456/500 - Loss: 0.296   lengthscale: 33.863   noise: 0.012
Iter 457/500 - Loss: 0.296   lengthscale: 33.972   noise: 0.012
Iter 458/500 - Loss: 0.295   lengthscale: 34.081   noise: 0.012
Iter 459/500 - Loss: 0.295   lengthscale: 34.190   noise: 0.012
Iter 460/500 - Loss: 0.294   lengthscale: 34.299   noise: 0.012
Iter 461/500 - Loss: 0.294   lengthscale: 34.408   noise: 0.012
Iter 462/500 - Loss: 0.293   lengthscale: 34.517   noise: 0.012
Iter 463/500 - Loss: 0.292   lengthscale: 34.626   noise: 0.012
Iter 464/500 - Loss: 0.292   lengthscale: 34.736   noise: 0.012
Iter 465/500 - Loss: 0.291   lengthscale: 34.845   noise: 0.012
Iter 466/500 - Loss: 0.291   lengthscale: 34.955   noise: 0.012
Iter 467/500 - Loss: 0.290   lengthscale: 35.064   noise: 0.012
Iter 468/500 - Loss: 0.290   lengthscale: 35.174   noise: 0.012
Iter 469/500 - Loss: 0.289   lengthscale: 35.284   noise: 0.013
Iter 470/500 - Loss: 0.289   lengthscale: 35.394   noise: 0.013
Iter 471/500 - Loss: 0.288   lengthscale: 35.503   noise: 0.013
Iter 472/500 - Loss: 0.287   lengthscale: 35.613   noise: 0.013
Iter 473/500 - Loss: 0.287   lengthscale: 35.723   noise: 0.013
Iter 474/500 - Loss: 0.286   lengthscale: 35.833   noise: 0.013
Iter 475/500 - Loss: 0.286   lengthscale: 35.943   noise: 0.013
Iter 476/500 - Loss: 0.285   lengthscale: 36.052   noise: 0.013
Iter 477/500 - Loss: 0.285   lengthscale: 36.162   noise: 0.013
Iter 478/500 - Loss: 0.284   lengthscale: 36.272   noise: 0.013
Iter 479/500 - Loss: 0.284   lengthscale: 36.382   noise: 0.013
Iter 480/500 - Loss: 0.283   lengthscale: 36.491   noise: 0.013
Iter 481/500 - Loss: 0.282   lengthscale: 36.601   noise: 0.013
Iter 482/500 - Loss: 0.282   lengthscale: 36.711   noise: 0.013
Iter 483/500 - Loss: 0.281   lengthscale: 36.821   noise: 0.013
Iter 484/500 - Loss: 0.281   lengthscale: 36.930   noise: 0.013
Iter 485/500 - Loss: 0.280   lengthscale: 37.040   noise: 0.013
Iter 486/500 - Loss: 0.280   lengthscale: 37.149   noise: 0.013
Iter 487/500 - Loss: 0.279   lengthscale: 37.259   noise: 0.013
Iter 488/500 - Loss: 0.279   lengthscale: 37.368   noise: 0.013
Iter 489/500 - Loss: 0.278   lengthscale: 37.477   noise: 0.013
Iter 490/500 - Loss: 0.278   lengthscale: 37.586   noise: 0.013
Iter 491/500 - Loss: 0.277   lengthscale: 37.695   noise: 0.013
Iter 492/500 - Loss: 0.276   lengthscale: 37.804   noise: 0.013
Iter 493/500 - Loss: 0.276   lengthscale: 37.913   noise: 0.013
Iter 494/500 - Loss: 0.275   lengthscale: 38.022   noise: 0.013
Iter 495/500 - Loss: 0.275   lengthscale: 38.131   noise: 0.013
Iter 496/500 - Loss: 0.274   lengthscale: 38.239   noise: 0.013
Iter 497/500 - Loss: 0.274   lengthscale: 38.348   noise: 0.013
Iter 498/500 - Loss: 0.273   lengthscale: 38.456   noise: 0.013
Iter 499/500 - Loss: 0.273   lengthscale: 38.564   noise: 0.013
Iter 500/500 - Loss: 0.272   lengthscale: 38.672   noise: 0.013


Now, we test predictions. For simplicity, we will assume a fixed variance of $$0.01.$$

:

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Test points are regularly spaced along [0,1]
# Make predictions by feeding model through likelihood
test_x = torch.linspace(0, 1, 51)
test_x_distributional = torch.stack((test_x, (1e-2 * torch.ones_like(test_x)).log()), dim=1)
observed_pred = likelihood(model(test_x_distributional))

# Initialize plot
f, ax = plt.subplots(1, 1, figsize=(8, 3))

# Get upper and lower confidence bounds
lower, upper = observed_pred.confidence_region()
# Plot training data as black stars
ax.errorbar(train_x_mean.numpy(), train_y.numpy(), xerr=train_x_stdv, fmt='k*')
# Plot predictive means as blue line
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence']) As a final note, we’ve made it very easy to extend the distributional kernel class by exposing a generic DistributionalInputKernel class that takes as input any distance function over probability distributions.

[ ]:




## Exact GPs with Scalable (GPU) Inference¶

In GPyTorch, Exact GP inference is still our preferred approach to large regression datasets. By coupling GPU acceleration with BlackBox Matrix-Matrix Inference and LancZos Variance Estimates (LOVE), GPyTorch can perform inference on datasets with over 1,000,000 data points while making very few approximations.

### BlackBox Matrix-Matrix Inference (BBMM)¶

BlackBox Matrix-Matrix Inference (introduced by Gardner et al., 2018) computes the GP marginal log likelihood using only matrix multiplication. It is stochastic, but can scale exact GPs to millions of data points.

### LancZos Variance Estimates (LOVE)¶

LanczOs Variance Estimates (LOVE) (introduced by Pleiss et al., 2019) is a technique to rapidly speed up predictive variances and posterior sampling. Check out the GP Regression with Fast Variances and Sampling (LOVE) notebook to see how to use LOVE in GPyTorch, and how it compares to standard variance computations.

#### GP Regression with LOVE for Fast Predictive Variances and Sampling¶

##### Overview¶

In this notebook, we demonstrate that LOVE (the method for fast variances and sampling introduced in this paper https://arxiv.org/abs/1803.06058) can significantly reduce the cost of computing predictive distributions. This can be especially useful in settings like small-scale Bayesian optimization, where predictions need to be made at enormous numbers of candidate points.

In this notebook, we will train a KISS-GP model on the skillcraftUCI dataset, and then compare the time required to make predictions with each model.

NOTE: The timing results reported in the paper compare the time required to compute (co)variances only. Because excluding the mean computations from the timing results requires hacking the internals of GPyTorch, the timing results presented in this notebook include the time required to compute predictive means, which are not accelerated by LOVE. Nevertheless, as we will see, LOVE achieves impressive speed-ups.

:

import math
import torch
import gpytorch
import tqdm
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline


For this example notebook, we’ll be using the elevators UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we’ll simply be splitting the data using the first 40% of the data as training and the last 60% as testing.

Note: Running the next cell will attempt to download a small dataset file to the current directory.

:

import urllib.request
import os
from math import floor

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)

if not smoke_test and not os.path.isfile('../elevators.mat'):

if smoke_test:  # this is for running the notebook in our testing framework
X, y = torch.randn(100, 3), torch.randn(100)
else:
X = data[:, :-1]
X = X - X.min(0)
X = 2 * (X / X.max(0)) - 1
y = data[:, -1]

train_n = int(floor(0.8 * len(X)))
train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()

test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()


LOVE can be used with any type of GP model, including exact GPs, multitask models and scalable approximations. Here we demonstrate LOVE in conjunction with KISS-GP, which has the amazing property of producing constant time variances.

##### The KISS-GP + LOVE GP Model¶

We now define the GP model. For more details on the use of GP models, see our simpler examples. This model uses a GridInterpolationKernel (SKI) with an Deep RBF base kernel. The forward method passes the input data x through the neural network feature extractor defined above, scales the resulting features to be between 0 and 1, and then calls the kernel.

The Deep RBF kernel (DKL) uses a neural network as an initial feature extractor. In this case, we use a fully connected network with the architecture d -> 1000 -> 500 -> 50 -> 2, as described in the original DKL paper. All of the code below uses standard PyTorch implementations of neural network layers.

:

class LargeFeatureExtractor(torch.nn.Sequential):
def __init__(self, input_dim):
super(LargeFeatureExtractor, self).__init__()

class GPRegressionModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)

self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.GridInterpolationKernel(
gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()),
grid_size=100, num_dims=2,
)

# Also add the deep net
self.feature_extractor = LargeFeatureExtractor(input_dim=train_x.size(-1))

def forward(self, x):
# We're first putting our data through a deep net (feature extractor)
# We're also scaling the features so that they're nice values
projected_x = self.feature_extractor(x)
projected_x = projected_x - projected_x.min(0)
projected_x = 2 * (projected_x / projected_x.max(0)) - 1

# The rest of this looks like what we've seen
mean_x = self.mean_module(projected_x)
covar_x = self.covar_module(projected_x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)

if torch.cuda.is_available():
model = model.cuda()
likelihood = likelihood.cuda()

###### Training the model¶

The cell below trains the GP model, finding optimal hyperparameters using Type-II MLE. We run 20 iterations of training using the Adam optimizer built in to PyTorch. With a decent GPU, this should only take a few seconds.

:

training_iterations = 1 if smoke_test else 20

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

def train():
iterator = tqdm.notebook.tqdm(range(training_iterations))
for i in iterator:
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
iterator.set_postfix(loss=loss.item())
optimizer.step()

%time train()


CPU times: user 2.1 s, sys: 136 ms, total: 2.24 s
Wall time: 2.23 s

##### Computing predictive variances (KISS-GP or Exact GPs)¶
###### Using standard computaitons (without LOVE)¶

The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in preds.mean) using the standard SKI testing code, with no acceleration or precomputation.

Note: Full predictive covariance matrices (and the computations needed to get them) can be quite memory intensive. Depending on the memory available on your GPU, you may need to reduce the size of the test set for the code below to run. If you run out of memory, try replacing test_x below with something like test_x[:1000] to use the first 1000 test points only, and then restart the notebook.

:

import time

# Set into eval mode
model.eval()
likelihood.eval()

start_time = time.time()
preds = likelihood(model(test_x))
exact_covar = preds.covariance_matrix
exact_covar_time = time.time() - start_time

print(f"Time to compute exact mean + covariances: {exact_covar_time:.2f}s")

Time to compute exact mean + covariances: 1.81s

###### Using LOVE¶

Next we compute predictive covariances (and the predictive means) for LOVE, but starting from scratch. That is, we don’t yet have access to the precomputed cache discussed in the paper. This should still be faster than the full covariance computation code above.

To use LOVE, use the context manager with gpytorch.settings.fast_pred_var():

You can also set some of the LOVE settings with context managers as well. For example, gpytorch.settings.max_root_decomposition_size(100) affects the accuracy of the LOVE solves (larger is more accurate, but slower).

In this simple example, we allow a rank 100 root decomposition, although increasing this to rank 20-40 should not affect the timing results substantially.

:

# Clear the cache from the previous computations
model.train()
likelihood.train()

# Set into eval mode
model.eval()
likelihood.eval()

start_time = time.time()
preds = model(test_x)
fast_time_no_cache = time.time() - start_time


The above cell additionally computed the caches required to get fast predictions. From this point onwards, unless we put the model back in training mode, predictions should be extremely fast. The cell below re-runs the above code, but takes full advantage of both the mean cache and the LOVE cache for variances.

:

with torch.no_grad(), gpytorch.settings.fast_pred_var():
start_time = time.time()
preds = likelihood(model(test_x))
fast_covar = preds.covariance_matrix
fast_time_with_cache = time.time() - start_time

:

print('Time to compute mean + covariances (no cache) {:.2f}s'.format(fast_time_no_cache))
print('Time to compute mean + variances (cache): {:.2f}s'.format(fast_time_with_cache))

Time to compute mean + covariances (no cache) 0.32s
Time to compute mean + variances (cache): 0.14s

###### Compute Error between Exact and Fast Variances¶

Finally, we compute the mean absolute error between the fast variances computed by LOVE (stored in fast_covar), and the exact variances computed previously.

Note that these tests were run with a root decomposition of rank 10, which is about the minimum you would realistically ever run with. Despite this, the fast variance estimates are quite good. If more accuracy was needed, increasing max_root_decomposition_size would provide even better estimates.

:

mae = ((exact_covar - fast_covar).abs() / exact_covar.abs()).mean()
print(f"MAE between exact covar matrix and fast covar matrix: {mae:.6f}")

MAE between exact covar matrix and fast covar matrix: 0.000657

##### Computing posterior samples (KISS-GP only)¶

With KISS-GP models, LOVE can also be used to draw fast posterior samples. (The same does not apply to exact GP models.)

###### Drawing samples the standard way (without LOVE)¶

We now draw samples from the posterior distribution. Without LOVE, we accomlish this by performing Cholesky on the posterior covariance matrix. This can be slow for large covariance matrices.

:

import time
num_samples = 20 if smoke_test else 20000

# Set into eval mode
model.eval()
likelihood.eval()

start_time = time.time()
exact_samples = model(test_x).rsample(torch.Size([num_samples]))
exact_sample_time = time.time() - start_time

print(f"Time to compute exact samples: {exact_sample_time:.2f}s")

Time to compute exact samples: 1.92s

###### Using LOVE¶

Next we compute posterior samples (and the predictive means) using LOVE. This requires the additional context manager with gpytorch.settings.fast_pred_samples():.

Note that we also need the with gpytorch.settings.fast_pred_var(): flag turned on. Both context managers respond to the gpytorch.settings.max_root_decomposition_size(100) setting.

:

# Clear the cache from the previous computations
model.train()
likelihood.train()

# Set into eval mode
model.eval()
likelihood.eval()

# NEW FLAG FOR SAMPLING
with gpytorch.settings.fast_pred_samples():
start_time = time.time()
_ = model(test_x).rsample(torch.Size([num_samples]))
fast_sample_time_no_cache = time.time() - start_time

# Repeat the timing now that the cache is computed
with gpytorch.settings.fast_pred_samples():
start_time = time.time()
love_samples = model(test_x).rsample(torch.Size([num_samples]))
fast_sample_time_cache = time.time() - start_time

print('Time to compute LOVE samples (no cache) {:.2f}s'.format(fast_sample_time_no_cache))
print('Time to compute LOVE samples (cache) {:.2f}s'.format(fast_sample_time_cache))

Time to compute LOVE samples (no cache) 0.74s
Time to compute LOVE samples (cache) 0.02s

###### Compute the empirical covariance matrices¶

Let’s see how well LOVE samples and exact samples recover the true covariance matrix.

:

# Compute exact posterior covar
start_time = time.time()
posterior = model(test_x)
mean, covar = posterior.mean, posterior.covariance_matrix

exact_empirical_covar = ((exact_samples - mean).t() @ (exact_samples - mean)) / num_samples
love_empirical_covar = ((love_samples - mean).t() @ (love_samples - mean)) / num_samples

exact_empirical_error = ((exact_empirical_covar - covar).abs()).mean()
love_empirical_error = ((love_empirical_covar - covar).abs()).mean()

print(f"Empirical covariance MAE (Exact samples): {exact_empirical_error}")
print(f"Empirical covariance MAE (LOVE samples): {love_empirical_error}")

Empirical covariance MAE (Exact samples): 0.0043566287495195866
Empirical covariance MAE (LOVE samples): 0.0061592841520905495

[ ]:




### Exact GPs with GPU Acceleration¶

Here are examples of Exact GPs using GPU acceleration.

#### GPyTorch Regression Tutorial (GPU)¶

(This notebook is the same as the simple GP regression tutorial notebook, but does all computations on a GPU for acceleration. Check out the multi-GPU tutorial if you have large datasets that needs multiple GPUs!)

##### Introduction¶

In this notebook, we demonstrate many of the design features of GPyTorch using the simplest example, training an RBF kernel Gaussian process on a simple function. We’ll be modeling the function

\begin{align} y &= \sin(2\pi x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 0.2) \end{align}

with 11 training examples, and testing on 51 test examples.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

###### Set up training data¶

In the next cell, we set up the training data for this example. We’ll be using 11 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels.

:

# Training data is 11 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 100)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2

###### Setting up the model¶

See the simple GP regression tutorial for a detailed explanation for all the terms.

:

class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

##### Using the GPU¶

To do computations on the GPU, we need to put our data and model onto the GPU. (This requires PyTorch with CUDA).

:

train_x = train_x.cuda()
train_y = train_y.cuda()
model = model.cuda()
likelihood = likelihood.cuda()


That’s it! All the training code is the same as in the simple GP regression tutorial.

###### Training the model¶
:

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

training_iter = 50
for i in range(training_iter):
# Zero gradients from previous iteration
# Output from model
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()

Iter 1/50 - Loss: 0.944   lengthscale: 0.693   noise: 0.693
Iter 2/50 - Loss: 0.913   lengthscale: 0.644   noise: 0.644
Iter 3/50 - Loss: 0.879   lengthscale: 0.598   noise: 0.598
Iter 4/50 - Loss: 0.841   lengthscale: 0.555   noise: 0.554
Iter 5/50 - Loss: 0.798   lengthscale: 0.514   noise: 0.513
Iter 6/50 - Loss: 0.750   lengthscale: 0.475   noise: 0.474
Iter 7/50 - Loss: 0.698   lengthscale: 0.439   noise: 0.437
Iter 8/50 - Loss: 0.645   lengthscale: 0.405   noise: 0.402
Iter 9/50 - Loss: 0.595   lengthscale: 0.372   noise: 0.369
Iter 10/50 - Loss: 0.548   lengthscale: 0.342   noise: 0.339
Iter 11/50 - Loss: 0.507   lengthscale: 0.315   noise: 0.310
Iter 12/50 - Loss: 0.469   lengthscale: 0.292   noise: 0.284
Iter 13/50 - Loss: 0.432   lengthscale: 0.272   noise: 0.259
Iter 14/50 - Loss: 0.398   lengthscale: 0.255   noise: 0.236
Iter 15/50 - Loss: 0.363   lengthscale: 0.241   noise: 0.215
Iter 16/50 - Loss: 0.329   lengthscale: 0.230   noise: 0.196
Iter 17/50 - Loss: 0.296   lengthscale: 0.222   noise: 0.178
Iter 18/50 - Loss: 0.263   lengthscale: 0.215   noise: 0.162
Iter 19/50 - Loss: 0.230   lengthscale: 0.210   noise: 0.147
Iter 20/50 - Loss: 0.198   lengthscale: 0.207   noise: 0.134
Iter 21/50 - Loss: 0.167   lengthscale: 0.205   noise: 0.122
Iter 22/50 - Loss: 0.136   lengthscale: 0.205   noise: 0.110
Iter 23/50 - Loss: 0.107   lengthscale: 0.206   noise: 0.100
Iter 24/50 - Loss: 0.079   lengthscale: 0.208   noise: 0.091
Iter 25/50 - Loss: 0.053   lengthscale: 0.211   noise: 0.083
Iter 26/50 - Loss: 0.028   lengthscale: 0.215   noise: 0.076
Iter 27/50 - Loss: 0.006   lengthscale: 0.220   noise: 0.069
Iter 28/50 - Loss: -0.013   lengthscale: 0.225   noise: 0.063
Iter 29/50 - Loss: -0.029   lengthscale: 0.231   noise: 0.058
Iter 30/50 - Loss: -0.043   lengthscale: 0.237   noise: 0.053
Iter 31/50 - Loss: -0.053   lengthscale: 0.243   noise: 0.049
Iter 32/50 - Loss: -0.060   lengthscale: 0.249   noise: 0.045
Iter 33/50 - Loss: -0.065   lengthscale: 0.254   noise: 0.042
Iter 34/50 - Loss: -0.066   lengthscale: 0.259   noise: 0.039
Iter 35/50 - Loss: -0.066   lengthscale: 0.262   noise: 0.037
Iter 36/50 - Loss: -0.063   lengthscale: 0.265   noise: 0.035
Iter 37/50 - Loss: -0.060   lengthscale: 0.266   noise: 0.033
Iter 38/50 - Loss: -0.056   lengthscale: 0.266   noise: 0.032
Iter 39/50 - Loss: -0.052   lengthscale: 0.265   noise: 0.031
Iter 40/50 - Loss: -0.049   lengthscale: 0.262   noise: 0.030
Iter 41/50 - Loss: -0.047   lengthscale: 0.259   noise: 0.029
Iter 42/50 - Loss: -0.046   lengthscale: 0.254   noise: 0.029
Iter 43/50 - Loss: -0.046   lengthscale: 0.249   noise: 0.029
Iter 44/50 - Loss: -0.047   lengthscale: 0.243   noise: 0.029
Iter 45/50 - Loss: -0.049   lengthscale: 0.237   noise: 0.029
Iter 46/50 - Loss: -0.051   lengthscale: 0.231   noise: 0.029
Iter 47/50 - Loss: -0.054   lengthscale: 0.225   noise: 0.030
Iter 48/50 - Loss: -0.057   lengthscale: 0.219   noise: 0.030
Iter 49/50 - Loss: -0.059   lengthscale: 0.214   noise: 0.031
Iter 50/50 - Loss: -0.061   lengthscale: 0.210   noise: 0.032

###### Make predictions with the model¶

First, we need to make some test data, and then throw it onto the GPU

:

test_x = torch.linspace(0, 1, 51).cuda()


Now the rest of the code follows the simple GP regression tutorial.

:

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Test points are regularly spaced along [0,1]
# Make predictions by feeding model through likelihood
observed_pred = likelihood(model(test_x))
mean = observed_pred.mean
lower, upper = observed_pred.confidence_region()


For plotting, we’re going to grab the data from the GPU and put it back on the CPU. We can accomplish this with the .cpu() function.

:

mean = mean.cpu()
lower = lower.cpu()
upper = upper.cpu()

train_x = train_x.cpu()
train_y = train_y.cpu()
test_x = test_x.cpu()

:

with torch.no_grad():
# Initialize plot
f, ax = plt.subplots(1, 1, figsize=(4, 3))

# Plot training data as black stars
ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
# Plot predictive means as blue line
ax.plot(test_x.numpy(), mean.numpy(), 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence']) #### Exact GP Regression with Multiple GPUs and Kernel Partitioning¶

##### Introduction¶

In this notebook, we’ll demonstrate training exact GPs on large datasets using two key features from the paper https://arxiv.org/abs/1903.08114:

1. The ability to distribute the kernel matrix across multiple GPUs, for additional parallelism.
2. Partitioning the kernel into chunks computed on-the-fly when performing each MVM to reduce memory usage.

We’ll be using the protein dataset, which has about 37000 training examples. The techniques in this notebook can be applied to much larger datasets, but the training time required will depend on the computational resources you have available: both the number of GPUs available and the amount of memory they have (which determines the partition size) have a significant effect on training time.

:

import math
import torch
import gpytorch
import sys
from matplotlib import pyplot as plt
sys.path.append('../')
from LBFGS import FullBatchLBFGS

%matplotlib inline


We will be using the Protein UCI dataset which contains a total of 40000+ data points. The next cell will download this dataset from a Google drive and load it.

:

import os
import urllib.request
dataset = 'protein'
if not os.path.isfile(f'../{dataset}.mat'):
f'../{dataset}.mat')


###### Normalization and train/test Splits¶

In the next cell, we split the data 80/20 as train and test, and do some basic z-score feature normalization.

:

import numpy as np

N = data.shape
# make train/val/test
n_train = int(0.8 * N)
train_x, train_y = data[:n_train, :-1], data[:n_train, -1]
test_x, test_y = data[n_train:, :-1], data[n_train:, -1]

# normalize features
mean = train_x.mean(dim=-2, keepdim=True)
std = train_x.std(dim=-2, keepdim=True) + 1e-6 # prevent dividing by 0
train_x = (train_x - mean) / std
test_x = (test_x - mean) / std

# normalize labels
mean, std = train_y.mean(),train_y.std()
train_y = (train_y - mean) / std
test_y = (test_y - mean) / std

# make continguous
train_x, train_y = train_x.contiguous(), train_y.contiguous()
test_x, test_y = test_x.contiguous(), test_y.contiguous()

output_device = torch.device('cuda:0')

train_x, train_y = train_x.to(output_device), train_y.to(output_device)
test_x, test_y = test_x.to(output_device), test_y.to(output_device)

##### How many GPUs do you want to use?¶

In the next cell, specify the n_devices variable to be the number of GPUs you’d like to use. By default, we will use all devices available to us.

:

n_devices = torch.cuda.device_count()
print('Planning to run on {} GPUs.'.format(n_devices))

Planning to run on 2 GPUs.


In the next cell we define our GP model and training code. For this notebook, the only thing different from the Simple GP tutorials is the use of the MultiDeviceKernel to wrap the base covariance module. This allows for the use of multiple GPUs behind the scenes.

:

class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood, n_devices):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
base_covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

self.covar_module = gpytorch.kernels.MultiDeviceKernel(
base_covar_module, device_ids=range(n_devices),
output_device=output_device
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

def train(train_x,
train_y,
n_devices,
output_device,
checkpoint_size,
preconditioner_size,
n_training_iter,
):
likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)
model.train()
likelihood.train()

optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \
gpytorch.settings.max_preconditioner_size(preconditioner_size):

def closure():
output = model(train_x)
loss = -mll(output, train_y)
return loss

loss = closure()
loss.backward()

for i in range(n_training_iter):
options = {'closure': closure, 'current_loss': loss, 'max_ls': 10}
loss, _, _, _, _, _, _, fail = optimizer.step(options)

print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
i + 1, n_training_iter, loss.item(),
model.covar_module.module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))

if fail:
print('Convergence reached!')
break

print(f"Finished training on {train_x.size(0)} data points using {n_devices} GPUs.")
return model, likelihood

##### Automatically determining GPU Settings¶

In the next cell, we automatically determine a roughly reasonable partition or checkpoint size that will allow us to train without using more memory than the GPUs available have. Not that this is a coarse estimate of the largest possible checkpoint size, and may be off by as much as a factor of 2. A smarter search here could make up to a 2x performance improvement.

:

import gc

def find_best_gpu_setting(train_x,
train_y,
n_devices,
output_device,
preconditioner_size
):
N = train_x.size(0)

# Find the optimum partition/checkpoint size by decreasing in powers of 2
settings =  + [int(n) for n in np.ceil(N / 2**np.arange(1, np.floor(np.log2(N))))]

for checkpoint_size in settings:
print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))
try:
# Try a full forward and backward pass with this setting to check memory usage
_, _ = train(train_x, train_y,
n_devices=n_devices, output_device=output_device,
checkpoint_size=checkpoint_size,
preconditioner_size=preconditioner_size, n_training_iter=1)

# when successful, break out of for-loop and jump to finally block
break
except RuntimeError as e:
print('RuntimeError: {}'.format(e))
except AttributeError as e:
print('AttributeError: {}'.format(e))
finally:
# handle CUDA OOM error
gc.collect()
torch.cuda.empty_cache()
return checkpoint_size

# Set a large enough preconditioner size to reduce the number of CG iterations run
preconditioner_size = 100
checkpoint_size = find_best_gpu_setting(train_x, train_y,
n_devices=n_devices,
output_device=output_device,
preconditioner_size=preconditioner_size)

Number of devices: 2 -- Kernel partition size: 0
RuntimeError: CUDA out of memory. Tried to allocate 2.49 GiB (GPU 1; 10.73 GiB total capacity; 7.48 GiB already allocated; 2.46 GiB free; 21.49 MiB cached)
Number of devices: 2 -- Kernel partition size: 18292
RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 10.73 GiB total capacity; 6.37 GiB already allocated; 448.94 MiB free; 1.30 GiB cached)
Number of devices: 2 -- Kernel partition size: 9146
Iter 1/1 - Loss: 0.893   lengthscale: 0.486   noise: 0.248
Finished training on 36584 data points using 2 GPUs.

###### Training¶
:

model, likelihood = train(train_x, train_y,
n_devices=n_devices, output_device=output_device,
checkpoint_size=10000,
preconditioner_size=100,
n_training_iter=20)

Iter 1/20 - Loss: 0.897   lengthscale: 0.486   noise: 0.253
Iter 2/20 - Loss: 0.892   lengthscale: 0.354   noise: 0.144
Iter 3/20 - Loss: 0.859   lengthscale: 0.305   noise: 0.125
Iter 4/20 - Loss: 0.859   lengthscale: 0.292   noise: 0.116
Iter 5/20 - Loss: 0.862   lengthscale: 0.292   noise: 0.116
Convergence reached!
Finished training on 36584 data points using 2 GPUs.

##### Computing test time caches¶
:

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Make predictions on a small number of test points to get the test time caches computed
latent_pred = model(test_x[:2, :])
del latent_pred  # We don't care about these predictions, we really just want the caches.

###### Testing: Computing predictions¶
:

with torch.no_grad(), gpytorch.settings.fast_pred_var(), gpytorch.beta_features.checkpoint_kernel(1000):
%time latent_pred = model(test_x)

test_rmse = torch.sqrt(torch.mean(torch.pow(latent_pred.mean - test_y, 2)))
print(f"Test RMSE: {test_rmse.item()}")

CPU times: user 1.1 s, sys: 728 ms, total: 1.83 s
Wall time: 1.88 s
Test RMSE: 0.551821768283844

:



[ ]:




#### GPyTorch Regression With KeOps¶

##### Introduction¶

KeOps is a recently released software package for fast kernel operations that integrates wih PyTorch. We can use the ability of KeOps to perform efficient kernel matrix multiplies on the GPU to integrate with the rest of GPyTorch.

In this tutorial, we’ll demonstrate how to integrate the kernel matmuls of KeOps with all of the bells of whistles of GPyTorch, including things like our preconditioning for conjugate gradients.

In this notebook, we will train an exact GP on 3droad, which has hundreds of thousands of data points. Together, the highly optimized matmuls of KeOps combined with algorithmic speed improvements like preconditioning allow us to train on a dataset like this in a matter of minutes using only a single GPU.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:


We will be using the 3droad UCI dataset which contains a total of 278,319 data points. The next cell will download this dataset from a Google drive and load it.

:

import urllib.request
import os.path
from math import floor


Downloading '3droad' UCI dataset...

:

import numpy as np

N = data.shape
# make train/val/test
n_train = int(0.8 * N)
train_x, train_y = data[:n_train, :-1], data[:n_train, -1]
test_x, test_y = data[n_train:, :-1], data[n_train:, -1]

# normalize features
mean = train_x.mean(dim=-2, keepdim=True)
std = train_x.std(dim=-2, keepdim=True) + 1e-6 # prevent dividing by 0
train_x = (train_x - mean) / std
test_x = (test_x - mean) / std

# normalize labels
mean, std = train_y.mean(),train_y.std()
train_y = (train_y - mean) / std
test_y = (test_y - mean) / std

# make continguous
train_x, train_y = train_x.contiguous(), train_y.contiguous()
test_x, test_y = test_x.contiguous(), test_y.contiguous()

output_device = torch.device('cuda:0')

train_x, train_y = train_x.to(output_device), train_y.to(output_device)
test_x, test_y = test_x.to(output_device), test_y.to(output_device)

##### Using KeOps with a GPyTorch Model¶

Using KeOps with one of our pre built kernels is as straightforward as swapping the kernel out. For example, in the cell below, we copy the simple GP from our basic tutorial notebook, and swap out gpytorch.kernels.MaternKernel for gpytorch.kernels.keops.MaternKernel.

:

# We will use the simplest form of GP model, exact inference
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()

self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.keops.MaternKernel(nu=2.5))

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda()
model = ExactGPModel(train_x, train_y, likelihood).cuda()

[ ]:

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

import time
training_iter = 50
for i in range(training_iter):
start_time = time.time()
# Zero gradients from previous iteration
# Output from model
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()
print(time.time() - start_time)

:

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Test points are regularly spaced along [0,1]
# Make predictions by feeding model through likelihood
observed_pred = likelihood(model(test_x))

Compiling libKeOpstorchd7ba409487 in /home/jake.gardner/.cache/pykeops-1.1.1//build-libKeOpstorchd7ba409487:
formula: Sum_Reduction(((((Var(0,1,2) * Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1)))))) + (IntCst(1) + (Var(3,1,2) * Square(Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1))))))))) * Exp((Var(4,1,2) * Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1)))))))) * Var(5,3320,1)),0)
aliases: Var(0,1,2); Var(1,18,0); Var(2,18,1); Var(3,1,2); Var(4,1,2); Var(5,3320,1);
dtype  : float32
... Done.
Compiling libKeOpstorch7385e76d34 in /home/jake.gardner/.cache/pykeops-1.1.1//build-libKeOpstorch7385e76d34:
formula: Sum_Reduction(((((Var(0,1,2) * Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1)))))) + (IntCst(1) + (Var(3,1,2) * Square(Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1))))))))) * Exp((Var(4,1,2) * Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1)))))))) * Var(5,1,1)),0)
aliases: Var(0,1,2); Var(1,18,0); Var(2,18,1); Var(3,1,2); Var(4,1,2); Var(5,1,1);
dtype  : float32
... Done.
Compiling libKeOpstorch97105370ea in /home/jake.gardner/.cache/pykeops-1.1.1//build-libKeOpstorch97105370ea:
formula: Sum_Reduction(((((Var(0,1,2) * Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1)))))) + (IntCst(1) + (Var(3,1,2) * Square(Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1))))))))) * Exp((Var(4,1,2) * Sqrt(Sum(Square((Var(1,18,0) - Var(2,18,1)))))))) * Var(5,100,1)),0)
aliases: Var(0,1,2); Var(1,18,0); Var(2,18,1); Var(3,1,2); Var(4,1,2); Var(5,100,1);
dtype  : float32
... Done.

###### Compute RMSE¶
:

torch.sqrt(torch.mean(torch.pow(observed_pred.mean - test_y, 2)))

:

tensor(0.1068, device='cuda:0')


### Scalable Posterior Sampling with CIQ¶

Here we provide a notebook demonstrating the use of Contour Integral Quadrature with msMINRES as described in the CIQ paper. For the most dramatic results, we recommend combining this technique with other techniques in this section like kernel checkpointing with KeOps, which would allow for posterior sampling on up to hundreds of thousands of test examples.

#### Scalable Exact GP Posterior Sampling using Contour Integral Quadrature¶

This notebook demonstrates the most simple usage of contour integral quadrature with msMINRES as described here to sample from the predictive distribution of an exact GP.

Note that to achieve results where Cholesky would run the GPU out of memory, you’ll either need to have KeOps installed (see our KeOps tutorial in this same folder), or use the checkpoint_kernel beta feature. Despite this, on this relatively simple example with 1000 training points but seeing to sample at 20000 test points in 1D, we will achieve significant speed ups over Cholesky.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

import warnings
warnings.simplefilter("ignore", gpytorch.utils.warnings.NumericalWarning)

%matplotlib inline

:

# Training data is 11 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 1000)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2

##### Are we running with KeOps?¶

If you have KeOps, change the below flag to True to run with a significantly larger test set.

:

HAVE_KEOPS = False

##### Define an Exact GP Model and train¶
:

class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()

if HAVE_KEOPS:
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.keops.RBFKernel())
else:
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

:

train_x = train_x.cuda()
train_y = train_y.cuda()
model = model.cuda()
likelihood = likelihood.cuda()

:

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

training_iter = 50
for i in range(training_iter):
# Zero gradients from previous iteration
# Output from model
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()

Iter 1/50 - Loss: 0.860   lengthscale: 0.693   noise: 0.693
Iter 2/50 - Loss: 0.816   lengthscale: 0.644   noise: 0.644
Iter 3/50 - Loss: 0.766   lengthscale: 0.598   noise: 0.598
Iter 4/50 - Loss: 0.718   lengthscale: 0.554   noise: 0.554
Iter 5/50 - Loss: 0.669   lengthscale: 0.513   noise: 0.513
Iter 6/50 - Loss: 0.624   lengthscale: 0.474   noise: 0.474
Iter 7/50 - Loss: 0.583   lengthscale: 0.439   noise: 0.437
Iter 8/50 - Loss: 0.534   lengthscale: 0.408   noise: 0.402
Iter 9/50 - Loss: 0.493   lengthscale: 0.381   noise: 0.370
Iter 10/50 - Loss: 0.459   lengthscale: 0.357   noise: 0.339
Iter 11/50 - Loss: 0.411   lengthscale: 0.336   noise: 0.311
Iter 12/50 - Loss: 0.383   lengthscale: 0.318   noise: 0.285
Iter 13/50 - Loss: 0.338   lengthscale: 0.303   noise: 0.261
Iter 14/50 - Loss: 0.308   lengthscale: 0.290   noise: 0.238
Iter 15/50 - Loss: 0.266   lengthscale: 0.278   noise: 0.217
Iter 16/50 - Loss: 0.228   lengthscale: 0.268   noise: 0.198
Iter 17/50 - Loss: 0.195   lengthscale: 0.260   noise: 0.181
Iter 18/50 - Loss: 0.153   lengthscale: 0.252   noise: 0.165
Iter 19/50 - Loss: 0.120   lengthscale: 0.246   noise: 0.150
Iter 20/50 - Loss: 0.089   lengthscale: 0.241   noise: 0.137
Iter 21/50 - Loss: 0.055   lengthscale: 0.236   noise: 0.124
Iter 22/50 - Loss: 0.034   lengthscale: 0.233   noise: 0.113
Iter 23/50 - Loss: -0.011   lengthscale: 0.230   noise: 0.103
Iter 24/50 - Loss: -0.030   lengthscale: 0.228   noise: 0.094
Iter 25/50 - Loss: -0.052   lengthscale: 0.226   noise: 0.086
Iter 26/50 - Loss: -0.081   lengthscale: 0.225   noise: 0.078
Iter 27/50 - Loss: -0.107   lengthscale: 0.225   noise: 0.072
Iter 28/50 - Loss: -0.120   lengthscale: 0.224   noise: 0.066
Iter 29/50 - Loss: -0.137   lengthscale: 0.224   noise: 0.060
Iter 30/50 - Loss: -0.146   lengthscale: 0.224   noise: 0.055
Iter 31/50 - Loss: -0.157   lengthscale: 0.225   noise: 0.051
Iter 32/50 - Loss: -0.168   lengthscale: 0.225   noise: 0.047
Iter 33/50 - Loss: -0.157   lengthscale: 0.226   noise: 0.044
Iter 34/50 - Loss: -0.164   lengthscale: 0.227   noise: 0.041
Iter 35/50 - Loss: -0.163   lengthscale: 0.229   noise: 0.039
Iter 36/50 - Loss: -0.154   lengthscale: 0.231   noise: 0.037
Iter 37/50 - Loss: -0.170   lengthscale: 0.233   noise: 0.035
Iter 38/50 - Loss: -0.162   lengthscale: 0.235   noise: 0.033
Iter 39/50 - Loss: -0.158   lengthscale: 0.238   noise: 0.032
Iter 40/50 - Loss: -0.150   lengthscale: 0.241   noise: 0.031
Iter 41/50 - Loss: -0.146   lengthscale: 0.245   noise: 0.031
Iter 42/50 - Loss: -0.145   lengthscale: 0.248   noise: 0.030
Iter 43/50 - Loss: -0.146   lengthscale: 0.251   noise: 0.030
Iter 44/50 - Loss: -0.150   lengthscale: 0.255   noise: 0.030
Iter 45/50 - Loss: -0.151   lengthscale: 0.259   noise: 0.030
Iter 46/50 - Loss: -0.152   lengthscale: 0.262   noise: 0.030
Iter 47/50 - Loss: -0.143   lengthscale: 0.265   noise: 0.031
Iter 48/50 - Loss: -0.148   lengthscale: 0.269   noise: 0.032
Iter 49/50 - Loss: -0.164   lengthscale: 0.272   noise: 0.032
Iter 50/50 - Loss: -0.168   lengthscale: 0.275   noise: 0.033

##### Define test set¶

If we have KeOps installed, we’ll test on 50000 points instead of 20000.

:

if HAVE_KEOPS:
test_n = 50000
else:
test_n = 17000

test_x = torch.linspace(0, 1, test_n).cuda()
print(test_x.shape)

torch.Size()

##### Draw a sample with CIQ¶

To do this, we just add the ciq_samples setting to the rsample call. We additionally demonstrate all relevant settings for controlling Contour Integral Quadrature:

• The ciq_samples setting determines whether or not to use CIQ
• The num_contour_quadrature setting controls the number of quadrature sites (Q in the paper).
• The minres_tolerance setting controls the error we tolerate from minres (here, <0.01%).

Note that, of these settings, increase num_contour_quadrature is unlikely to improve performance. As Theorem 1 from the paper demonstrates, virtually all of the error in this method is controlled by minres_tolerance. Here, we use a quite tight tolerance for minres.

:

import time

model.train()
likelihood.train()

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Test points are regularly spaced along [0,1]
# Make predictions by feeding model through likelihood

observed_pred = likelihood(model(test_x))

# All relevant settings for using CIQ.
#   ciq_samples(True) - Use CIQ for sampling
#   minres_tolerance -- error tolerance from minres (here, <0.01%).
print("Running with CIQ")
%time y_samples = observed_pred.rsample()

print("Running with Cholesky")
# Make sure we use Cholesky
with gpytorch.settings.fast_computations(covar_root_decomposition=False):
%time y_samples = observed_pred.rsample()

Wall time: 798 ms
Wall time: 7.97 s

[ ]:




### Scalable Kernel Approximations¶

While exact computations are our preferred approach, GPyTorch offer approximate kernels to reduce the asymptotic complexity of inference.

#### Sparse Gaussian Process Regression (SGPR)¶

##### Overview¶

In this notebook, we’ll overview how to use SGPR in which the inducing point locations are learned.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline


For this example notebook, we’ll be using the elevators UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we’ll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

Note: Running the next cell will attempt to download a ~400 KB dataset file to the current directory.

:

import urllib.request
import os
from math import floor

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)

if not smoke_test and not os.path.isfile('../elevators.mat'):

if smoke_test:  # this is for running the notebook in our testing framework
X, y = torch.randn(1000, 3), torch.randn(1000)
else:
X = data[:, :-1]
X = X - X.min(0)
X = 2 * (X / X.max(0)) - 1
y = data[:, -1]

train_n = int(floor(0.8 * len(X)))
train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()

test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()

:

X.size()

:

torch.Size([16599, 18])

##### Defining the SGPR Model¶

We now define the GP model. For more details on the use of GP models, see our simpler examples. This model constructs a base scaled RBF kernel, and then simply wraps it in an InducingPointKernel. Other than this, everything should look the same as in the simple GP models.

:

from gpytorch.means import ConstantMean
from gpytorch.kernels import ScaleKernel, RBFKernel, InducingPointKernel
from gpytorch.distributions import MultivariateNormal

class GPRegressionModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = ConstantMean()
self.base_covar_module = ScaleKernel(RBFKernel())
self.covar_module = InducingPointKernel(self.base_covar_module, inducing_points=train_x[:500, :], likelihood=likelihood)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return MultivariateNormal(mean_x, covar_x)

:

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)

if torch.cuda.is_available():
model = model.cuda()
likelihood = likelihood.cuda()

###### Training the model¶
:

training_iterations = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

def train():
for i in range(training_iterations):
# Get output from model
output = model(train_x)
# Calc loss and backprop derivatives
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
optimizer.step()
torch.cuda.empty_cache()

# See dkl_mnist.ipynb for explanation of this flag
%time train()

Iter 1/50 - Loss: 0.794
Iter 2/50 - Loss: 0.782
Iter 3/50 - Loss: 0.770
Iter 4/50 - Loss: 0.758
Iter 5/50 - Loss: 0.746
Iter 6/50 - Loss: 0.734
Iter 7/50 - Loss: 0.721
Iter 8/50 - Loss: 0.708
Iter 9/50 - Loss: 0.695
Iter 10/50 - Loss: 0.681
Iter 11/50 - Loss: 0.667
Iter 12/50 - Loss: 0.654
Iter 13/50 - Loss: 0.641
Iter 14/50 - Loss: 0.626
Iter 15/50 - Loss: 0.613
Iter 16/50 - Loss: 0.598
Iter 17/50 - Loss: 0.584
Iter 18/50 - Loss: 0.571
Iter 19/50 - Loss: 0.555
Iter 20/50 - Loss: 0.541
Iter 21/50 - Loss: 0.526
Iter 22/50 - Loss: 0.510
Iter 23/50 - Loss: 0.495
Iter 24/50 - Loss: 0.481
Iter 25/50 - Loss: 0.465
Iter 26/50 - Loss: 0.449
Iter 27/50 - Loss: 0.435
Iter 28/50 - Loss: 0.417
Iter 29/50 - Loss: 0.401
Iter 30/50 - Loss: 0.384
Iter 31/50 - Loss: 0.369
Iter 32/50 - Loss: 0.351
Iter 33/50 - Loss: 0.336
Iter 34/50 - Loss: 0.319
Iter 35/50 - Loss: 0.303
Iter 36/50 - Loss: 0.286
Iter 37/50 - Loss: 0.269
Iter 38/50 - Loss: 0.253
Iter 39/50 - Loss: 0.236
Iter 40/50 - Loss: 0.217
Iter 41/50 - Loss: 0.200
Iter 42/50 - Loss: 0.181
Iter 43/50 - Loss: 0.167
Iter 44/50 - Loss: 0.149
Iter 45/50 - Loss: 0.132
Iter 46/50 - Loss: 0.112
Iter 47/50 - Loss: 0.096
Iter 48/50 - Loss: 0.078
Iter 49/50 - Loss: 0.061
Iter 50/50 - Loss: 0.044
CPU times: user 2min 47s, sys: 7.87 s, total: 2min 55s
Wall time: 34.6 s

###### Making Predictions¶

The next cell makes predictions with SKIP. We use the same max_root_decomposition size, and we also demonstrate increasing the max preconditioner size. Increasing the preconditioner size on this dataset is not necessary, but can make a big difference in final test performance, and is often preferable to increasing the number of CG iterations if you can afford the space.

:

model.eval()
likelihood.eval()
with gpytorch.settings.max_root_decomposition_size(30), gpytorch.settings.fast_pred_var():
preds = model(test_x)

:

print('Test MAE: {}'.format(torch.mean(torch.abs(preds.mean - test_y))))

Test MAE: 0.07271435856819153

[ ]:




#### Structured Kernel Interpollation (SKI/KISS-GP)¶

##### Overview¶

SKI (or KISS-GP) is a great way to scale a GP up to very large datasets (100,000+ data points). Kernel interpolation for scalable structured Gaussian processes (KISS-GP) was introduced in this paper: http://proceedings.mlr.press/v37/wilson15.pdf

SKI is asymptotically very fast (nearly linear), very precise (error decays cubically), and easy to use in GPyTorch! As you will see in this tutorial, it’s really easy to apply SKI to an existing model. All you have to do is wrap your kernel module with a GridInterpolationKernel.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline

##### KISS-GP for 1D Data¶
###### Set up training data¶

We’ll learn a simple sinusoid, but with lots of training data points. At 1000 points, this is where scalable methods start to become useful.

:

train_x = torch.linspace(0, 1, 1000)
train_y = torch.sin(train_x * (4 * math.pi) + torch.randn(train_x.size()) * 0.2)

###### Set up the model¶

The model should be somewhat similar to the ExactGP model in the simple regression example.

The only difference: we’re wrapping our kernel module in a GridInterpolationKernel. This signals to GPyTorch that you want to approximate this kernel matrix with SKI.

SKI has only one hyperparameter that you need to worry about: the grid size. For 1D functions, a good starting place is to use as many grid points as training points. (Don’t worry - the grid points are really cheap to use!). You can use the gpytorch.utils.grid.choose_grid_size helper to get a good starting point.

If you want, you can also explicitly determine the grid bounds of the SKI approximation using the grid_bounds argument. However, it’s easier if you don’t use this argument - then GPyTorch automatically chooses the best bounds for you.

:

class GPRegressionModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)

# SKI requires a grid size hyperparameter. This util can help with that. Here we are using a grid that has the same number of points as the training data (a ratio of 1.0). Performance can be sensitive to this parameter, so you may want to adjust it for your own problem on a validation set.
grid_size = gpytorch.utils.grid.choose_grid_size(train_x,1.0)

self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.GridInterpolationKernel(
gpytorch.kernels.RBFKernel(), grid_size=grid_size, num_dims=1
)
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)

###### Train the model hyperparameters¶

Even with 1000 points, this model still trains fast! SKI scales (essentially) linearly with data - whereas standard GP inference scales quadratically (in GPyTorch.)

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iterations = 1 if smoke_test else 30

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iterations):
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
optimizer.step()

###### Make predictions¶

SKI is especially well-suited for predictions. It can comnpute predictive means in constant time, and with LOVE enabled (see this notebook), predictive variances are also constant time.

:

# Put model & likelihood into eval mode
model.eval()
likelihood.eval()

# Initalize plot
f, ax = plt.subplots(1, 1, figsize=(4, 3))

# The gpytorch.settings.fast_pred_var flag activates LOVE (for fast variances)
# See https://arxiv.org/abs/1803.06058
test_x = torch.linspace(0, 1, 51)
prediction = likelihood(model(test_x))
mean = prediction.mean
# Get lower and upper predictive bounds
lower, upper = prediction.confidence_region()

# Plot the training data as black stars
def ax_plot():
if smoke_test: return  # this is for running the notebook in our testing framework

ax.plot(train_x.detach().numpy(), train_y.detach().numpy(), 'k*')
# Plot predictive means as blue line
ax.plot(test_x.detach().numpy(), mean.detach().numpy(), 'b')
# Plot confidence bounds as lightly shaded region
ax.fill_between(test_x.detach().numpy(), lower.detach().numpy(), upper.detach().numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence'])

ax_plot() ##### KISS-GP for 2D-4D Data¶

For 2-4D functions, SKI (or KISS-GP) can work very well out-of-the-box on larger datasets (100,000+ data points). Kernel interpolation for scalable structured Gaussian processes (KISS-GP) was introduced in this paper: http://proceedings.mlr.press/v37/wilson15.pdf

One thing to watch out for with multidimensional SKI - you can’t use as fine-grain of a grid. If you have a high dimensional problem, you may want to try one of the other scalable regression methods.

###### Set up train data¶

Here we’re learning a simple sin function - but in 2 dimensions

:

# We make an nxn grid of training points spaced every 1/(n-1) on [0,1]x[0,1]
n = 40
train_x = torch.zeros(pow(n, 2), 2)
for i in range(n):
for j in range(n):
train_x[i * n + j] = float(i) / (n-1)
train_x[i * n + j] = float(j) / (n-1)
# True function is sin( 2*pi*(x0+x1))
train_y = torch.sin((train_x[:, 0] + train_x[:, 1]) * (2 * math.pi)) + torch.randn_like(train_x[:, 0]).mul(0.01)

###### The model¶

As with the 1D case, applying SKI to a multidimensional kernel is as simple as wrapping that kernel with a GridInterpolationKernel. You’ll want to be sure to set num_dims though!

SKI has only one hyperparameter that you need to worry about: the grid size. For 1D functions, a good starting place is to use as many grid points as training points. (Don’t worry - the grid points are really cheap to use!). You can use the gpytorch.utils.grid.choose_grid_size helper to get a good starting point.

If you want, you can also explicitly determine the grid bounds of the SKI approximation using the grid_bounds argument. However, it’s easier if you don’t use this argument - then GPyTorch automatically chooses the best bounds for you.

:

class GPRegressionModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)

# SKI requires a grid size hyperparameter. This util can help with that
grid_size = gpytorch.utils.grid.choose_grid_size(train_x)

self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.GridInterpolationKernel(
gpytorch.kernels.RBFKernel(), grid_size=grid_size, num_dims=2
)
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)

###### Train the model hyperparameters¶
:

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

def train():
for i in range(training_iterations):
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
optimizer.step()

%time train()

CPU times: user 57.8 s, sys: 847 ms, total: 58.7 s
Wall time: 10.5 s

###### Make predictions with the model¶
:

# Set model and likelihood into evaluation mode
model.eval()
likelihood.eval()

# Generate nxn grid of test points spaced on a grid of size 1/(n-1) in [0,1]x[0,1]
n = 10
test_x = torch.zeros(int(pow(n, 2)), 2)
for i in range(n):
for j in range(n):
test_x[i * n + j] = float(i) / (n-1)
test_x[i * n + j] = float(j) / (n-1)

observed_pred = likelihood(model(test_x))
pred_labels = observed_pred.mean.view(n, n)

# Calc abosolute error
test_y_actual = torch.sin(((test_x[:, 0] + test_x[:, 1]) * (2 * math.pi))).view(n, n)
delta_y = torch.abs(pred_labels - test_y_actual).detach().numpy()

# Define a plotting function
def ax_plot(f, ax, y_labels, title):
if smoke_test: return  # this is for running the notebook in our testing framework
im = ax.imshow(y_labels)
ax.set_title(title)
f.colorbar(im)

# Plot our predictive means
f, observed_ax = plt.subplots(1, 1, figsize=(4, 3))
ax_plot(f, observed_ax, pred_labels, 'Predicted Values (Likelihood)')

# Plot the true values
f, observed_ax2 = plt.subplots(1, 1, figsize=(4, 3))
ax_plot(f, observed_ax2, test_y_actual, 'Actual Values (Likelihood)')

# Plot the absolute errors
f, observed_ax3 = plt.subplots(1, 1, figsize=(4, 3))
ax_plot(f, observed_ax3, delta_y, 'Absolute Error Surface')   ##### KISS-GP for higher dimensional data w/ Additive Structure¶

The method above won’t scale to data with much more than ~4 dimensions, since the cost of creating the grid grows exponentially in the amount of data. Therefore, we’ll need to make some additional approximations.

If the function you are modeling has additive structure across its dimensions, then SKI can be one of the most efficient methods for your problem.

To set this up, we’ll wrap the GridInterpolationKernel used in the previous two models with one additional kernel: the AdditiveStructureKernel. The model will look something like this:

:

class GPRegressionModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)

# SKI requires a grid size hyperparameter. This util can help with that
# We're setting Kronecker structure to False because we're using an additive structure decomposition
grid_size = gpytorch.utils.grid.choose_grid_size(train_x, kronecker_structure=False)

self.mean_module = gpytorch.means.ConstantMean()
gpytorch.kernels.ScaleKernel(
gpytorch.kernels.GridInterpolationKernel(
gpytorch.kernels.RBFKernel(), grid_size=128, num_dims=1
)
), num_dims=2
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)


Essentially, the AdditiveStructureKernel makes the base kernel (in this case, the GridInterpolationKernel wrapping the RBFKernel) to act as 1D kernels on each data dimension. The final kernel matrix will be a sum of these 1D kernel matrices.

##### Scaling to more dimensions (without additive structure)¶

If you can’t exploit additive structure, then try one of these other two methods:

1. SKIP - or Scalable Kernel Interpolation for Products
2. KISS-GP with Deep Kernel Learning

#### Scalable Kernel Interpolation for Product Kernels (SKIP)¶

##### Overview¶

In this notebook, we’ll overview of how to use SKIP, a method that exploits product structure in some kernels to reduce the dependency of SKI on the data dimensionality from exponential to linear.

The most important practical consideration to note in this notebook is the use of gpytorch.settings.max_root_decomposition_size, which we explain the use of right before the training loop cell.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline

/home/jrg365/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py:465: UserWarning: matplotlibrc text.usetex option can not be used unless TeX is installed on your system
warnings.warn('matplotlibrc text.usetex option can not be used unless '
/home/jrg365/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py:473: UserWarning: matplotlibrc text.usetex can not be used with *Agg backend unless dvipng-1.6 or later is installed on your system


For this example notebook, we’ll be using the elevators UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we’ll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

Note: Running the next cell will attempt to download a ~400 KB dataset file to the current directory.

:

import urllib.request
import os
from math import floor

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)

if not smoke_test and not os.path.isfile('../elevators.mat'):

if smoke_test:  # this is for running the notebook in our testing framework
X, y = torch.randn(1000, 3), torch.randn(1000)
else:
X = data[:, :-1]
X = X - X.min(0)
X = 2 * (X / X.max(0)) - 1
y = data[:, -1]

train_n = int(floor(0.8 * len(X)))
train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()

test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()

:

X.size()

:

torch.Size([16599, 18])

##### Defining the SKIP GP Model¶

We now define the GP model. For more details on the use of GP models, see our simpler examples. This model uses a GridInterpolationKernel (SKI) with an RBF base kernel. To use SKIP, we make two changes:

• First, we use only a 1 dimensional GridInterpolationKernel (e.g., by passing num_dims=1). The idea of SKIP is to use a product of 1 dimensional GridInterpolationKernels instead of a single d dimensional one.
• Next, we create a ProductStructureKernel that wraps our 1D GridInterpolationKernel with num_dims=18. This specifies that we want to use product structure over 18 dimensions, using the 1D GridInterpolationKernel in each dimension.

Note: If you’ve explored the rest of the package, you may be wondering what the differences between AdditiveKernel, AdditiveStructureKernel, ProductKernel, and ProductStructureKernel are. The Structure kernels (1) assume that we want to apply a single base kernel over a fully decomposed dataset (e.g., every dimension is additive or has product structure), and (2) are significantly more efficient as a result, because they can exploit batch parallel operations instead of using for loops.

:

from gpytorch.means import ConstantMean
from gpytorch.kernels import ScaleKernel, RBFKernel, ProductStructureKernel, GridInterpolationKernel
from gpytorch.distributions import MultivariateNormal

class GPRegressionModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = ConstantMean()
self.base_covar_module = RBFKernel()
self.covar_module = ProductStructureKernel(
ScaleKernel(
GridInterpolationKernel(self.base_covar_module, grid_size=100, num_dims=1)
), num_dims=18
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return MultivariateNormal(mean_x, covar_x)

:

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)

if torch.cuda.is_available():
model = model.cuda()
likelihood = likelihood.cuda()

###### Training the model¶

The training loop for SKIP has one main new feature we haven’t seen before: we specify the max_root_decomposition_size. This controls how many iterations of Lanczos we want to use for SKIP, and trades off with time and–more importantly–space. Realistically, the goal should be to set this as high as possible without running out of memory.

In some sense, this parameter is the main trade-off of SKIP. Whereas many inducing point methods care more about the number of inducing points, because SKIP approximates one dimensional kernels, it is able to do so very well with relatively few inducing points. The main source of approximation really comes from these Lanczos decompositions we perform.

:

training_iterations = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

def train():
for i in range(training_iterations):
with gpytorch.settings.use_toeplitz(False), gpytorch.settings.max_root_decomposition_size(30):
# Get output from model
output = model(train_x)
# Calc loss and backprop derivatives
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
optimizer.step()
torch.cuda.empty_cache()

# See dkl_mnist.ipynb for explanation of this flag
with gpytorch.settings.use_toeplitz(True):
%time train()

Iter 1/25 - Loss: 0.942
Iter 2/25 - Loss: 0.919
Iter 3/25 - Loss: 0.888
Iter 4/25 - Loss: 0.864
Iter 5/25 - Loss: 0.840
Iter 6/25 - Loss: 0.816
Iter 7/25 - Loss: 0.792
Iter 8/25 - Loss: 0.767
Iter 9/25 - Loss: 0.743
Iter 10/25 - Loss: 0.719
Iter 11/25 - Loss: 0.698
Iter 12/25 - Loss: 0.671
Iter 13/25 - Loss: 0.651
Iter 14/25 - Loss: 0.624
Iter 15/25 - Loss: 0.600
Iter 16/25 - Loss: 0.576
Iter 17/25 - Loss: 0.553
Iter 18/25 - Loss: 0.529
Iter 19/25 - Loss: 0.506
Iter 20/25 - Loss: 0.483
Iter 21/25 - Loss: 0.460
Iter 22/25 - Loss: 0.441
Iter 23/25 - Loss: 0.413
Iter 24/25 - Loss: 0.391
Iter 25/25 - Loss: 0.375
CPU times: user 1min 6s, sys: 26.8 s, total: 1min 33s
Wall time: 1min 48s

###### Making Predictions¶

The next cell makes predictions with SKIP. We use the same max_root_decomposition size, and we also demonstrate increasing the max preconditioner size. Increasing the preconditioner size on this dataset is not necessary, but can make a big difference in final test performance, and is often preferable to increasing the number of CG iterations if you can afford the space.

:

model.eval()
likelihood.eval()
with gpytorch.settings.use_toeplitz(False), gpytorch.settings.max_root_decomposition_size(30), gpytorch.settings.fast_pred_var():
preds = model(test_x)

:

print('Test MAE: {}'.format(torch.mean(torch.abs(preds.mean - test_y))))

Test MAE: 0.07745790481567383

[ ]:




### Structure-Exploiting Kernels¶

If your data lies on a Euclidean grid, and your GP uses a stationary kernel, the computations can be sped up dramatically. See the Grid Regression example for more info.

#### GP Regression with Grid Structured Training Data¶

In this notebook, we demonstrate how to perform GP regression when your training data lies on a regularly spaced grid. For this example, we’ll be modeling a 2D function where the training data is on an evenly spaced grid on (0,1)x(0, 2) with 100 grid points in each dimension.

In other words, we have 10000 training examples. However, the grid structure of the training data will allow us to perform inference very quickly anyways.

:

import gpytorch
import torch
import math

##### Make the grid and training data¶

In the next cell, we create the grid, along with the 10000 training examples and labels. After running this cell, we create three important tensors:

• grid is a tensor that is grid_size x 2 and contains the 1D grid for each dimension.
• train_x is a tensor containing the full 10000 training examples.
• train_y are the labels. For this, we’re just using a simple sine function.
:

grid_bounds = [(0, 1), (0, 2)]
grid_size = 25
grid = torch.zeros(grid_size, len(grid_bounds))
for i in range(len(grid_bounds)):
grid_diff = float(grid_bounds[i] - grid_bounds[i]) / (grid_size - 2)
grid[:, i] = torch.linspace(grid_bounds[i] - grid_diff, grid_bounds[i] + grid_diff, grid_size)

train_x = gpytorch.utils.grid.create_data_from_grid(grid)
train_y = torch.sin((train_x[:, 0] + train_x[:, 1]) * (2 * math.pi)) + torch.randn_like(train_x[:, 0]).mul(0.01)

##### Creating the Grid GP Model¶

In the next cell we create our GP model. Like other scalable GP methods, we’ll use a scalable kernel that wraps a base kernel. In this case, we create a GridKernel that wraps an RBFKernel.

:

class GridGPRegressionModel(gpytorch.models.ExactGP):
def __init__(self, grid, train_x, train_y, likelihood):
super(GridGPRegressionModel, self).__init__(train_x, train_y, likelihood)
num_dims = train_x.size(-1)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.GridKernel(gpytorch.kernels.RBFKernel(), grid=grid)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GridGPRegressionModel(grid, train_x, train_y, likelihood)

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
# Zero gradients from previous iteration
# Output from model
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()

Iter 1/50 - Loss: 0.978   lengthscale: 0.693   noise: 0.693
Iter 2/50 - Loss: 0.933   lengthscale: 0.644   noise: 0.644
Iter 3/50 - Loss: 0.880   lengthscale: 0.598   noise: 0.598
Iter 4/50 - Loss: 0.816   lengthscale: 0.554   noise: 0.554
Iter 5/50 - Loss: 0.738   lengthscale: 0.513   noise: 0.513
Iter 6/50 - Loss: 0.655   lengthscale: 0.474   noise: 0.473
Iter 7/50 - Loss: 0.575   lengthscale: 0.438   noise: 0.436
Iter 8/50 - Loss: 0.507   lengthscale: 0.403   noise: 0.401
Iter 9/50 - Loss: 0.450   lengthscale: 0.372   noise: 0.368
Iter 10/50 - Loss: 0.398   lengthscale: 0.345   noise: 0.338
Iter 11/50 - Loss: 0.351   lengthscale: 0.321   noise: 0.309
Iter 12/50 - Loss: 0.306   lengthscale: 0.301   noise: 0.282
Iter 13/50 - Loss: 0.259   lengthscale: 0.283   noise: 0.258
Iter 14/50 - Loss: 0.214   lengthscale: 0.269   noise: 0.235
Iter 15/50 - Loss: 0.167   lengthscale: 0.256   noise: 0.213
Iter 16/50 - Loss: 0.120   lengthscale: 0.245   noise: 0.194
Iter 17/50 - Loss: 0.075   lengthscale: 0.236   noise: 0.176
Iter 18/50 - Loss: 0.027   lengthscale: 0.228   noise: 0.160
Iter 19/50 - Loss: -0.021   lengthscale: 0.222   noise: 0.145
Iter 20/50 - Loss: -0.067   lengthscale: 0.217   noise: 0.131
Iter 21/50 - Loss: -0.118   lengthscale: 0.213   noise: 0.118
Iter 22/50 - Loss: -0.167   lengthscale: 0.210   noise: 0.107
Iter 23/50 - Loss: -0.213   lengthscale: 0.208   noise: 0.097
Iter 24/50 - Loss: -0.267   lengthscale: 0.207   noise: 0.087
Iter 25/50 - Loss: -0.311   lengthscale: 0.207   noise: 0.079
Iter 26/50 - Loss: -0.363   lengthscale: 0.208   noise: 0.071
Iter 27/50 - Loss: -0.414   lengthscale: 0.209   noise: 0.064
Iter 28/50 - Loss: -0.460   lengthscale: 0.212   noise: 0.057
Iter 29/50 - Loss: -0.514   lengthscale: 0.216   noise: 0.052
Iter 30/50 - Loss: -0.563   lengthscale: 0.221   noise: 0.047
Iter 31/50 - Loss: -0.616   lengthscale: 0.227   noise: 0.042
Iter 32/50 - Loss: -0.668   lengthscale: 0.234   noise: 0.038
Iter 33/50 - Loss: -0.717   lengthscale: 0.243   noise: 0.034
Iter 34/50 - Loss: -0.773   lengthscale: 0.253   noise: 0.031
Iter 35/50 - Loss: -0.821   lengthscale: 0.264   noise: 0.027
Iter 36/50 - Loss: -0.868   lengthscale: 0.276   noise: 0.025
Iter 37/50 - Loss: -0.927   lengthscale: 0.290   noise: 0.022
Iter 38/50 - Loss: -0.981   lengthscale: 0.305   noise: 0.020
Iter 39/50 - Loss: -1.026   lengthscale: 0.320   noise: 0.018
Iter 40/50 - Loss: -1.084   lengthscale: 0.338   noise: 0.016
Iter 41/50 - Loss: -1.133   lengthscale: 0.355   noise: 0.014
Iter 42/50 - Loss: -1.183   lengthscale: 0.371   noise: 0.013
Iter 43/50 - Loss: -1.231   lengthscale: 0.385   noise: 0.012
Iter 44/50 - Loss: -1.277   lengthscale: 0.397   noise: 0.011
Iter 45/50 - Loss: -1.323   lengthscale: 0.402   noise: 0.009
Iter 46/50 - Loss: -1.374   lengthscale: 0.403   noise: 0.008
Iter 47/50 - Loss: -1.426   lengthscale: 0.401   noise: 0.008
Iter 48/50 - Loss: -1.481   lengthscale: 0.395   noise: 0.007
Iter 49/50 - Loss: -1.529   lengthscale: 0.388   noise: 0.006
Iter 50/50 - Loss: -1.576   lengthscale: 0.379   noise: 0.006


In the next cell, we create a set of 400 test examples and make predictions. Note that unlike other scalable GP methods, testing is more complicated. Because our test data can be different from the training data, in general we may not be able to avoid creating a num_train x num_test (e.g., 10000 x 400) kernel matrix between the training and test data.

For this reason, if you have large numbers of test points, memory may become a concern. The time complexity should still be reasonable, however, because we will still exploit structure in the train-train covariance matrix.

:

model.eval()
likelihood.eval()
n = 20
test_x = torch.zeros(int(pow(n, 2)), 2)
for i in range(n):
for j in range(n):
test_x[i * n + j] = float(i) / (n-1)
test_x[i * n + j] = float(j) / (n-1)

observed_pred = likelihood(model(test_x))

:

import matplotlib.pyplot as plt
%matplotlib inline

pred_labels = observed_pred.mean.view(n, n)

# Calc abosolute error
test_y_actual = torch.sin(((test_x[:, 0] + test_x[:, 1]) * (2 * math.pi))).view(n, n)
delta_y = torch.abs(pred_labels - test_y_actual).detach().numpy()

# Define a plotting function
def ax_plot(f, ax, y_labels, title):
if smoke_test: return  # this is for running the notebook in our testing framework
im = ax.imshow(y_labels)
ax.set_title(title)
f.colorbar(im)

# Plot our predictive means
f, observed_ax = plt.subplots(1, 1, figsize=(4, 3))
ax_plot(f, observed_ax, pred_labels, 'Predicted Values (Likelihood)')

# Plot the true values
f, observed_ax2 = plt.subplots(1, 1, figsize=(4, 3))
ax_plot(f, observed_ax2, test_y_actual, 'Actual Values (Likelihood)')

# Plot the absolute errors
f, observed_ax3 = plt.subplots(1, 1, figsize=(4, 3))
ax_plot(f, observed_ax3, delta_y, 'Absolute Error Surface')   [ ]:




## Multitask/Multioutput GPs with Exact Inference¶

Exact GPs can be used to model vector valued functions, or functions that represent multiple tasks. There are several different cases:

### Multi-output (vector valued functions)¶

##### Introduction¶

Multitask regression, introduced in this paper learns similarities in the outputs simultaneously. It’s useful when you are performing regression on multiple functions that share the same inputs, especially if they have similarities (such as being sinusodial).

Given inputs $$x$$ and $$x'$$, and tasks $$i$$ and $$j$$, the covariance between two datapoints and two tasks is given by

$k([x, i], [x', j]) = k_\text{inputs}(x, x') * k_\text{tasks}(i, j)$

where $$k_\text{inputs}$$ is a standard kernel (e.g. RBF) that operates on the inputs. $$k_\text{task}$$ is a lookup table containing inter-task covariance.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:

###### Set up training data¶

In the next cell, we set up the training data for this example. We’ll be using 100 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels.

We’ll have two functions - a sine function (y1) and a cosine function (y2).

For MTGPs, our train_targets will actually have two dimensions: with the second dimension corresponding to the different tasks.

:

train_x = torch.linspace(0, 1, 100)

train_y = torch.stack([
torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
torch.cos(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
], -1)


The model should be somewhat similar to the ExactGP model in the simple regression example. The differences:

1. We’re going to wrap ConstantMean with a MultitaskMean. This makes sure we have a mean function for each task.
2. Rather than just using a RBFKernel, we’re using that in conjunction with a MultitaskKernel. This gives us the covariance function described in the introduction.
3. We’re using a MultitaskMultivariateNormal and MultitaskGaussianLikelihood. This allows us to deal with the predictions/outputs in a nice way. For example, when we call MultitaskMultivariateNormal.mean, we get a n x num_tasks matrix back.

You may also notice that we don’t use a ScaleKernel, since the IndexKernel will do some scaling for us. (This way we’re not overparameterizing the kernel.)

:

class MultitaskGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
)
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)


###### Train the model hyperparameters¶
:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iterations = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iterations):
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
optimizer.step()

Iter 1/50 - Loss: 47.568
Iter 2/50 - Loss: 42.590
Iter 3/50 - Loss: 37.327
Iter 4/50 - Loss: 32.383
Iter 5/50 - Loss: 27.693
Iter 6/50 - Loss: 22.967
Iter 7/50 - Loss: 18.709
Iter 8/50 - Loss: 13.625
Iter 9/50 - Loss: 9.454
Iter 10/50 - Loss: 3.937
Iter 11/50 - Loss: -0.266
Iter 12/50 - Loss: -5.492
Iter 13/50 - Loss: -9.174
Iter 14/50 - Loss: -14.201
Iter 15/50 - Loss: -17.646
Iter 16/50 - Loss: -23.065
Iter 17/50 - Loss: -27.227
Iter 18/50 - Loss: -31.771
Iter 19/50 - Loss: -35.461
Iter 20/50 - Loss: -40.396
Iter 21/50 - Loss: -43.209
Iter 22/50 - Loss: -48.011
Iter 23/50 - Loss: -52.596
Iter 24/50 - Loss: -55.427
Iter 25/50 - Loss: -58.277
Iter 26/50 - Loss: -62.170
Iter 27/50 - Loss: -66.251
Iter 28/50 - Loss: -68.859
Iter 29/50 - Loss: -71.799
Iter 30/50 - Loss: -74.687
Iter 31/50 - Loss: -77.924
Iter 32/50 - Loss: -80.209
Iter 33/50 - Loss: -82.885
Iter 34/50 - Loss: -85.627
Iter 35/50 - Loss: -87.761
Iter 36/50 - Loss: -88.781
Iter 37/50 - Loss: -88.784
Iter 38/50 - Loss: -90.362
Iter 39/50 - Loss: -92.546
Iter 40/50 - Loss: -92.249
Iter 41/50 - Loss: -93.311
Iter 42/50 - Loss: -92.987
Iter 43/50 - Loss: -93.307
Iter 44/50 - Loss: -93.322
Iter 45/50 - Loss: -92.269
Iter 46/50 - Loss: -91.461
Iter 47/50 - Loss: -90.908
Iter 48/50 - Loss: -92.142
Iter 49/50 - Loss: -93.466
Iter 50/50 - Loss: -90.492

###### Make predictions with the model¶
:

# Set into eval mode
model.eval()
likelihood.eval()

# Initialize plots
f, (y1_ax, y2_ax) = plt.subplots(1, 2, figsize=(8, 3))

# Make predictions
test_x = torch.linspace(0, 1, 51)
predictions = likelihood(model(test_x))
mean = predictions.mean
lower, upper = predictions.confidence_region()

# This contains predictions for both tasks, flattened out
# The first half of the predictions is for the first task
# The second half is for the second task

# Plot training data as black stars
y1_ax.plot(train_x.detach().numpy(), train_y[:, 0].detach().numpy(), 'k*')
# Predictive mean as blue line
y1_ax.plot(test_x.numpy(), mean[:, 0].numpy(), 'b')
y1_ax.fill_between(test_x.numpy(), lower[:, 0].numpy(), upper[:, 0].numpy(), alpha=0.5)
y1_ax.set_ylim([-3, 3])
y1_ax.legend(['Observed Data', 'Mean', 'Confidence'])
y1_ax.set_title('Observed Values (Likelihood)')

# Plot training data as black stars
y2_ax.plot(train_x.detach().numpy(), train_y[:, 1].detach().numpy(), 'k*')
# Predictive mean as blue line
y2_ax.plot(test_x.numpy(), mean[:, 1].numpy(), 'b')
y2_ax.fill_between(test_x.numpy(), lower[:, 1].numpy(), upper[:, 1].numpy(), alpha=0.5)
y2_ax.set_ylim([-3, 3])
y2_ax.legend(['Observed Data', 'Mean', 'Confidence'])
y2_ax.set_title('Observed Values (Likelihood)')

None [ ]:




#### Batch Independent Multioutput GP¶

##### Introduction¶

This notebook demonstrates how to wrap independent GP models into a convenient Multi-Output GP model. It uses batch dimensions for efficient computation. Unlike in the Multitask GP Example, this do not model correlations between outcomes, but treats outcomes independently.

This type of model is useful if - when the number of training / test points is equal for the different outcomes - using the same covariance modules and / or likelihoods for each outcome

For non-block designs (i.e. when the above points do not apply), you should instead use a ModelList GP as described in the ModelList multioutput example.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

###### Set up training data¶

In the next cell, we set up the training data for this example. We’ll be using 100 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels.

We’ll have two functions - a sine function (y1) and a cosine function (y2).

For MTGPs, our train_targets will actually have two dimensions: with the second dimension corresponding to the different tasks.

:

train_x = torch.linspace(0, 1, 100)

train_y = torch.stack([
torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
torch.cos(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
], -1)

##### Define a batch GP model¶

The model should be somewhat similar to the ExactGP model in the simple regression example. The differences:

1. The model will use the batch dimension to learn multiple independent GPs simultaneously.
2. We’re going to give the mean and covariance modules a batch_shape argument. This allows us to learn different hyperparameters for each model.
3. The model will return a MultitaskMultivariateNormal distribution rather than a MultivariateNormal. We will construct this distribution to convert the batch dimensions into distinct outputs.
:

class BatchIndependentMultitaskGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean(batch_shape=torch.Size())
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel(batch_shape=torch.Size()),
batch_shape=torch.Size()
)

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
)


###### Train the model hyperparameters¶
:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iterations = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iterations):
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
optimizer.step()

Iter 1/50 - Loss: 122.929
Iter 2/50 - Loss: 119.192
Iter 3/50 - Loss: 115.233
Iter 4/50 - Loss: 111.096
Iter 5/50 - Loss: 106.824
Iter 6/50 - Loss: 102.458
Iter 7/50 - Loss: 98.038
Iter 8/50 - Loss: 93.608
Iter 9/50 - Loss: 89.226
Iter 10/50 - Loss: 84.959
Iter 11/50 - Loss: 80.849
Iter 12/50 - Loss: 76.895
Iter 13/50 - Loss: 73.049
Iter 14/50 - Loss: 69.255
Iter 15/50 - Loss: 65.470
Iter 16/50 - Loss: 61.671
Iter 17/50 - Loss: 57.848
Iter 18/50 - Loss: 54.001
Iter 19/50 - Loss: 50.137
Iter 20/50 - Loss: 46.268
Iter 21/50 - Loss: 42.412
Iter 22/50 - Loss: 38.591
Iter 23/50 - Loss: 34.830
Iter 24/50 - Loss: 31.155
Iter 25/50 - Loss: 27.596
Iter 26/50 - Loss: 24.183
Iter 27/50 - Loss: 20.947
Iter 28/50 - Loss: 17.915
Iter 29/50 - Loss: 15.109
Iter 30/50 - Loss: 12.543
Iter 31/50 - Loss: 10.213
Iter 32/50 - Loss: 8.110
Iter 33/50 - Loss: 6.225
Iter 34/50 - Loss: 4.556
Iter 35/50 - Loss: 3.111
Iter 36/50 - Loss: 1.903
Iter 37/50 - Loss: 0.948
Iter 38/50 - Loss: 0.252
Iter 39/50 - Loss: -0.192
Iter 40/50 - Loss: -0.410
Iter 41/50 - Loss: -0.438
Iter 42/50 - Loss: -0.323
Iter 43/50 - Loss: -0.113
Iter 44/50 - Loss: 0.145
Iter 45/50 - Loss: 0.410
Iter 46/50 - Loss: 0.648
Iter 47/50 - Loss: 0.836
Iter 48/50 - Loss: 0.961
Iter 49/50 - Loss: 1.016
Iter 50/50 - Loss: 1.004

###### Make predictions with the model¶
:

# Set into eval mode
model.eval()
likelihood.eval()

# Initialize plots
f, (y1_ax, y2_ax) = plt.subplots(1, 2, figsize=(8, 3))

# Make predictions
test_x = torch.linspace(0, 1, 51)
predictions = likelihood(model(test_x))
mean = predictions.mean
lower, upper = predictions.confidence_region()

# This contains predictions for both tasks, flattened out
# The first half of the predictions is for the first task
# The second half is for the second task

# Plot training data as black stars
y1_ax.plot(train_x.detach().numpy(), train_y[:, 0].detach().numpy(), 'k*')
# Predictive mean as blue line
y1_ax.plot(test_x.numpy(), mean[:, 0].numpy(), 'b')
y1_ax.fill_between(test_x.numpy(), lower[:, 0].numpy(), upper[:, 0].numpy(), alpha=0.5)
y1_ax.set_ylim([-3, 3])
y1_ax.legend(['Observed Data', 'Mean', 'Confidence'])
y1_ax.set_title('Observed Values (Likelihood)')

# Plot training data as black stars
y2_ax.plot(train_x.detach().numpy(), train_y[:, 1].detach().numpy(), 'k*')
# Predictive mean as blue line
y2_ax.plot(test_x.numpy(), mean[:, 1].numpy(), 'b')
y2_ax.fill_between(test_x.numpy(), lower[:, 1].numpy(), upper[:, 1].numpy(), alpha=0.5)
y2_ax.set_ylim([-3, 3])
y2_ax.legend(['Observed Data', 'Mean', 'Confidence'])
y2_ax.set_title('Observed Values (Likelihood)')

None [ ]:




#### ModelList (Multi-Output) GP Regression¶

##### Introduction¶

This notebook demonstrates how to wrap independent GP models into a convenient Multi-Output GP model using a ModelList.

Unlike in the Multitask case, this do not model correlations between outcomes, but treats outcomes independently. This is equivalent to setting up a separate GP for each outcome, but can be much more convenient to handle, in particular it does not require manually looping over models when fitting or predicting.

This type of model is useful if - when the number of training / test points is different for the different outcomes - using different covariance modules and / or likelihoods for each outcome

For block designs (i.e. when the above points do not apply), you should instead use a batch mode GP as described in the batch independent multioutput example. This will be much faster because it uses additional parallelism.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

###### Set up training data¶

In the next cell, we set up the training data for this example. We’ll be using a different number of training examples for the different GPs.

:

train_x1 = torch.linspace(0, 0.95, 50) + 0.05 * torch.rand(50)
train_x2 = torch.linspace(0, 0.95, 25) + 0.05 * torch.rand(25)

train_y1 = torch.sin(train_x1 * (2 * math.pi)) + 0.2 * torch.randn_like(train_x1)
train_y2 = torch.cos(train_x2 * (2 * math.pi)) + 0.2 * torch.randn_like(train_x2)

##### Set up the sub-models¶

Each individual model uses the ExactGP model from the simple regression example.

:

class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

likelihood1 = gpytorch.likelihoods.GaussianLikelihood()
model1 = ExactGPModel(train_x1, train_y1, likelihood1)

likelihood2 = gpytorch.likelihoods.GaussianLikelihood()
model2 = ExactGPModel(train_x2, train_y2, likelihood2)


We now collect the submodels in an IndependentMultiOutputGP, and the respective likelihoods in a MultiOutputLikelihood. These are container modules that make it easy to work with multiple outputs. In particular, they will take in and return lists of inputs / outputs and delegate the data to / from the appropriate sub-model (it is important that the order of the inputs / outputs corresponds to the order of models with which the containers were instantiated).

:

model = gpytorch.models.IndependentModelList(model1, model2)
likelihood = gpytorch.likelihoods.LikelihoodList(model1.likelihood, model2.likelihood)

###### Set up overall Marginal Log Likelihood¶

Assuming independence, the MLL for the container model is simply the sum of the MLLs for the individual models. SumMarginalLogLikelihood is a convenient container for this (by default it uses an ExactMarginalLogLikelihood for each submodel)

:

from gpytorch.mlls import SumMarginalLogLikelihood

mll = SumMarginalLogLikelihood(likelihood, model)

###### Train the model hyperparameters¶

With the containers in place, the models can be trained in a single loop on the container (note that this means that optimization is performed jointly, which can be an issue if the individual submodels require training via very different step sizes).

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iterations = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes all submodel and all likelihood parameters
], lr=0.1)

for i in range(training_iterations):
output = model(*model.train_inputs)
loss = -mll(output, model.train_targets)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
optimizer.step()

Iter 1/50 - Loss: 1.052
Iter 2/50 - Loss: 1.020
Iter 3/50 - Loss: 0.983
Iter 4/50 - Loss: 0.942
Iter 5/50 - Loss: 0.898
Iter 6/50 - Loss: 0.851
Iter 7/50 - Loss: 0.803
Iter 8/50 - Loss: 0.754
Iter 9/50 - Loss: 0.707
Iter 10/50 - Loss: 0.663
Iter 11/50 - Loss: 0.624
Iter 12/50 - Loss: 0.589
Iter 13/50 - Loss: 0.557
Iter 14/50 - Loss: 0.523
Iter 15/50 - Loss: 0.493
Iter 16/50 - Loss: 0.460
Iter 17/50 - Loss: 0.424
Iter 18/50 - Loss: 0.395
Iter 19/50 - Loss: 0.362
Iter 20/50 - Loss: 0.331
Iter 21/50 - Loss: 0.304
Iter 22/50 - Loss: 0.261
Iter 23/50 - Loss: 0.231
Iter 24/50 - Loss: 0.202
Iter 25/50 - Loss: 0.182
Iter 26/50 - Loss: 0.153
Iter 27/50 - Loss: 0.138
Iter 28/50 - Loss: 0.121
Iter 29/50 - Loss: 0.104
Iter 30/50 - Loss: 0.092
Iter 31/50 - Loss: 0.078
Iter 32/50 - Loss: 0.067
Iter 33/50 - Loss: 0.058
Iter 34/50 - Loss: 0.060
Iter 35/50 - Loss: 0.046
Iter 36/50 - Loss: 0.055
Iter 37/50 - Loss: 0.046
Iter 38/50 - Loss: 0.053
Iter 39/50 - Loss: 0.063
Iter 40/50 - Loss: 0.059
Iter 41/50 - Loss: 0.057
Iter 42/50 - Loss: 0.067
Iter 43/50 - Loss: 0.086
Iter 44/50 - Loss: 0.070
Iter 45/50 - Loss: 0.067
Iter 46/50 - Loss: 0.081
Iter 47/50 - Loss: 0.075
Iter 48/50 - Loss: 0.068
Iter 49/50 - Loss: 0.071
Iter 50/50 - Loss: 0.060

###### Make predictions with the model¶
:

# Set into eval mode
model.eval()
likelihood.eval()

# Initialize plots
f, axs = plt.subplots(1, 2, figsize=(8, 3))

# Make predictions (use the same test points)
test_x = torch.linspace(0, 1, 51)
# This contains predictions for both outcomes as a list
predictions = likelihood(*model(test_x, test_x))

for submodel, prediction, ax in zip(model.models, predictions, axs):
mean = prediction.mean
lower, upper = prediction.confidence_region()

tr_x = submodel.train_inputs.detach().numpy()
tr_y = submodel.train_targets.detach().numpy()

# Plot training data as black stars
ax.plot(tr_x, tr_y, 'k*')
# Predictive mean as blue line
ax.plot(test_x.numpy(), mean.numpy(), 'b')
ax.fill_between(test_x.numpy(), lower.detach().numpy(), upper.detach().numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence'])
ax.set_title('Observed Values (Likelihood)')

None ### Scalar function with multiple tasks¶

See the Hadamard Multitask GP Regression example. This setting should be used only when each input corresponds to a single task.

##### Introduction¶

This notebook demonstrates how to perform “Hadamard” multitask regression. This differs from the multitask gp regression example notebook in one key way:

• Here, we assume that we have observations for one task per input. For each input, we specify the task of the input that we observe. (The kernel that we learn is expressed as a Hadamard product of an input kernel and a task kernel)
• In the other notebook, we assume that we observe all tasks per input. (The kernel in that notebook is the Kronecker product of an input kernel and a task kernel).

Multitask regression, first introduced in this paper learns similarities in the outputs simultaneously. It’s useful when you are performing regression on multiple functions that share the same inputs, especially if they have similarities (such as being sinusodial).

Given inputs $$x$$ and $$x'$$, and tasks $$i$$ and $$j$$, the covariance between two datapoints and two tasks is given by

$k([x, i], [x', j]) = k_\text{inputs}(x, x') * k_\text{tasks}(i, j)$

where $$k_\text{inputs}$$ is a standard kernel (e.g. RBF) that operates on the inputs. $$k_\text{task}$$ is a special kernel - the IndexKernel - which is a lookup table containing inter-task covariance.

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

###### Set up training data¶

In the next cell, we set up the training data for this example. For each task we’ll be using 50 random points on [0,1), which we evaluate the function on and add Gaussian noise to get the training labels. Note that different inputs are used for each task.

We’ll have two functions - a sine function (y1) and a cosine function (y2).

:

train_x1 = torch.rand(50)
train_x2 = torch.rand(50)

train_y1 = torch.sin(train_x1 * (2 * math.pi)) + torch.randn(train_x1.size()) * 0.2
train_y2 = torch.cos(train_x2 * (2 * math.pi)) + torch.randn(train_x2.size()) * 0.2


The model should be somewhat similar to the ExactGP model in the simple regression example.

The differences:

1. The model takes two input: the inputs (x) and indices. The indices indicate which task the observation is for.
2. Rather than just using a RBFKernel, we’re using that in conjunction with a IndexKernel.
3. We don’t use a ScaleKernel, since the IndexKernel will do some scaling for us. (This way we’re not overparameterizing the kernel.)
:

class MultitaskGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.RBFKernel()

# We learn an IndexKernel for 2 tasks
# (so we'll actually learn 2x2=4 tasks with correlations)

def forward(self,x,i):
mean_x = self.mean_module(x)

# Get input-input covariance
covar_x = self.covar_module(x)
# Multiply the two together to get the covariance we want
covar = covar_x.mul(covar_i)

return gpytorch.distributions.MultivariateNormal(mean_x, covar)

likelihood = gpytorch.likelihoods.GaussianLikelihood()

full_train_x = torch.cat([train_x1, train_x2])
full_train_y = torch.cat([train_y1, train_y2])

# Here we have two iterms that we're passing in as train_inputs
model = MultitaskGPModel((full_train_x, full_train_i), full_train_y, likelihood)

###### Training the model¶

In the next cell, we handle using Type-II MLE to train the hyperparameters of the Gaussian process.

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iterations = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iterations):
output = model(full_train_x, full_train_i)
loss = -mll(output, full_train_y)
loss.backward()
print('Iter %d/50 - Loss: %.3f' % (i + 1, loss.item()))
optimizer.step()

Iter 1/50 - Loss: 0.956
Iter 2/50 - Loss: 0.916
Iter 3/50 - Loss: 0.877
Iter 4/50 - Loss: 0.839
Iter 5/50 - Loss: 0.803
Iter 6/50 - Loss: 0.767
Iter 7/50 - Loss: 0.730
Iter 8/50 - Loss: 0.692
Iter 9/50 - Loss: 0.653
Iter 10/50 - Loss: 0.613
Iter 11/50 - Loss: 0.575
Iter 12/50 - Loss: 0.539
Iter 13/50 - Loss: 0.504
Iter 14/50 - Loss: 0.470
Iter 15/50 - Loss: 0.435
Iter 16/50 - Loss: 0.399
Iter 17/50 - Loss: 0.362
Iter 18/50 - Loss: 0.325
Iter 19/50 - Loss: 0.287
Iter 20/50 - Loss: 0.250
Iter 21/50 - Loss: 0.215
Iter 22/50 - Loss: 0.182
Iter 23/50 - Loss: 0.151
Iter 24/50 - Loss: 0.122
Iter 25/50 - Loss: 0.093
Iter 26/50 - Loss: 0.066
Iter 27/50 - Loss: 0.040
Iter 28/50 - Loss: 0.016
Iter 29/50 - Loss: -0.005
Iter 30/50 - Loss: -0.022
Iter 31/50 - Loss: -0.037
Iter 32/50 - Loss: -0.049
Iter 33/50 - Loss: -0.059
Iter 34/50 - Loss: -0.066
Iter 35/50 - Loss: -0.071
Iter 36/50 - Loss: -0.074
Iter 37/50 - Loss: -0.075
Iter 38/50 - Loss: -0.074
Iter 39/50 - Loss: -0.071
Iter 40/50 - Loss: -0.068
Iter 41/50 - Loss: -0.065
Iter 42/50 - Loss: -0.062
Iter 43/50 - Loss: -0.060
Iter 44/50 - Loss: -0.058
Iter 45/50 - Loss: -0.057
Iter 46/50 - Loss: -0.056
Iter 47/50 - Loss: -0.057
Iter 48/50 - Loss: -0.058
Iter 49/50 - Loss: -0.060
Iter 50/50 - Loss: -0.062

###### Make predictions with the model¶
:

# Set into eval mode
model.eval()
likelihood.eval()

# Initialize plots
f, (y1_ax, y2_ax) = plt.subplots(1, 2, figsize=(8, 3))

# Test points every 0.02 in [0,1]
test_x = torch.linspace(0, 1, 51)

# Make predictions - one task at a time

# The gpytorch.settings.fast_pred_var flag activates LOVE (for fast variances)
# See https://arxiv.org/abs/1803.06058

# Define plotting function
def ax_plot(ax, train_y, train_x, rand_var, title):
# Get lower and upper confidence bounds
lower, upper = rand_var.confidence_region()
# Plot training data as black stars
ax.plot(train_x.detach().numpy(), train_y.detach().numpy(), 'k*')
# Predictive mean as blue line
ax.plot(test_x.detach().numpy(), rand_var.mean.detach().numpy(), 'b')
ax.fill_between(test_x.detach().numpy(), lower.detach().numpy(), upper.detach().numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence'])
ax.set_title(title)

ax_plot(y1_ax, train_y1, train_x1, observed_pred_y1, 'Observed Values (Likelihood)')
ax_plot(y2_ax, train_y2, train_x2, observed_pred_y2, 'Observed Values (Likelihood)') ## Variational and Approximate GPs¶

Variational and approximate Gaussian processes are used in a variety of cases:

• When the GP likelihood is non-Gaussian (e.g. for classification).
• To scale up GP regression (by using stochastic optimization).
• To use GPs as part of larger probablistic models.

With GPyTorch it is possible to implement various types approximate GP models. All approximate models consist of the following 3 composible objects:

• VariationalDistribution, which define the form of the approximate inducing value posterior $$q(\mathbf u)$$.
• VarationalStrategies, which define how to compute $$q(\mathbf f(\mathbf X))$$ from $$q(\mathbf u)$$.
• _ApproximateMarginalLogLikelihood, which defines the objective function to learn the approximate posterior (e.g. variational ELBO).

(See the strategy/distribution comparison for examples of the different classes.) The variational documentation has more information on how to use these objects. Here we provide some examples which highlight some of the common use cases:

### Stochastic Variational GP Regression¶

#### Overview¶

In this notebook, we’ll give an overview of how to use SVGP stochastic variational regression ((https://arxiv.org/pdf/1411.2005.pdf)) to rapidly train using minibatches on the 3droad UCI dataset with hundreds of thousands of training examples. This is one of the more common use-cases of variational inference for GPs.

If you are unfamiliar with variational inference, we recommend the following resources: - Variational Inference: A Review for Statisticians by David M. Blei, Alp Kucukelbir, Jon D. McAuliffe. - Scalable Variational Gaussian Process Classification by James Hensman, Alex Matthews, Zoubin Ghahramani.

:

import tqdm
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline


For this example notebook, we’ll be using the song UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we’ll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

Note: Running the next cell will attempt to download a ~136 MB file to the current directory.

:

import urllib.request
import os
from math import floor

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)

if not smoke_test and not os.path.isfile('../elevators.mat'):

if smoke_test:  # this is for running the notebook in our testing framework
X, y = torch.randn(1000, 3), torch.randn(1000)
else:
X = data[:, :-1]
X = X - X.min(0)
X = 2 * (X / X.max(0)) - 1
y = data[:, -1]

train_n = int(floor(0.8 * len(X)))
train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()

test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()


The next step is to create a torch DataLoader that will handle getting us random minibatches of data. This involves using the standard TensorDataset and DataLoader modules provided by PyTorch.

In this notebook we’ll be using a fairly large batch size of 1024 just to make optimization run faster, but you could of course change this as you so choose.

:

from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)

test_dataset = TensorDataset(test_x, test_y)


#### Creating a SVGP Model¶

For most variational/approximate GP models, you will need to construct the following GPyTorch objects:

1. A GP Model (gpytorch.models.ApproximateGP) - This handles basic variational inference.
2. A Variational distribution (gpytorch.variational._VariationalDistribution) - This tells us what form the variational distribution q(u) should take.
3. A Variational strategy (gpytorch.variational._VariationalStrategy) - This tells us how to transform a distribution q(u) over the inducing point values to a distribution q(f) over the latent function values for some input x.

Here, we use a VariationalStrategy with learn_inducing_points=True, and a CholeskyVariationalDistribution. These are the most straightforward and common options.

##### The GP Model¶

The ApproximateGP model is GPyTorch’s simplest approximate inference model. It approximates the true posterior with a distribution specified by a VariationalDistribution, which is most commonly some form of MultivariateNormal distribution. The model defines all the variational parameters that are needed, and keeps all of this information under the hood.

The components of a user built ApproximateGP model in GPyTorch are:

1. An __init__ method that constructs a mean module, a kernel module, a variational distribution object and a variational strategy object. This method should also be responsible for construting whatever other modules might be necessary.
2. A forward method that takes in some $$n \times d$$ data x and returns a MultivariateNormal with the prior mean and covariance evaluated at x. In other words, we return the vector $$\mu(x)$$ and the $$n \times n$$ matrix $$K_{xx}$$ representing the prior mean and covariance matrix of the GP.
:

from gpytorch.models import ApproximateGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import VariationalStrategy

class GPModel(ApproximateGP):
def __init__(self, inducing_points):
variational_distribution = CholeskyVariationalDistribution(inducing_points.size(0))
variational_strategy = VariationalStrategy(self, inducing_points, variational_distribution, learn_inducing_locations=True)
super(GPModel, self).__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

inducing_points = train_x[:500, :]
model = GPModel(inducing_points=inducing_points)
likelihood = gpytorch.likelihoods.GaussianLikelihood()

if torch.cuda.is_available():
model = model.cuda()
likelihood = likelihood.cuda()

###### Training the Model¶

The cell below trains the model above, learning both the hyperparameters of the Gaussian process and the parameters of the neural network in an end-to-end fashion using Type-II MLE.

Unlike when using the exact GP marginal log likelihood, performing variational inference allows us to make use of stochastic optimization techniques. For this example, we’ll do one epoch of training. Given the small size of the neural network relative to the size of the dataset, this should be sufficient to achieve comparable accuracy to what was observed in the DKL paper.

The optimization loop differs from the one seen in our more simple tutorials in that it involves looping over both a number of training iterations (epochs) and minibatches of the data. However, the basic process is the same: for each minibatch, we forward through the model, compute the loss (the VariationalELBO or ELBO), call backwards, and do a step of optimization.

:

num_epochs = 1 if smoke_test else 4

model.train()
likelihood.train()

# We use SGD here, rather than Adam. Emperically, we find that SGD is better for variational regression
{'params': model.parameters()},
{'params': likelihood.parameters()},
], lr=0.01)

# Our loss object. We're using the VariationalELBO
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.size(0))

epochs_iter = tqdm.notebook.tqdm(range(num_epochs), desc="Epoch")
for i in epochs_iter:
# Within each iteration, we will go over each minibatch of data
for x_batch, y_batch in minibatch_iter:
output = model(x_batch)
loss = -mll(output, y_batch)
minibatch_iter.set_postfix(loss=loss.item())
loss.backward()
optimizer.step()



###### Making Predictions¶

The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in preds.mean()). Because the test set is substantially smaller than the training set, we don’t need to make predictions in mini batches here, although this can be done by passing in minibatches of test_x rather than the full tensor.

:

model.eval()
likelihood.eval()
means = torch.tensor([0.])
preds = model(x_batch)
means = torch.cat([means, preds.mean.cpu()])
means = means[1:]

:

print('Test MAE: {}'.format(torch.mean(torch.abs(means - test_y.cpu()))))

Test MAE: 0.11148034781217575

[ ]:



:

import math
import torch
import gpytorch
import tqdm
import matplotlib.pyplot as plt
%matplotlib inline


### Modifying the Variational Strategy/Variational Distribution¶

The predictive distribution for approximate GPs is given by

$p( \mathbf f(\mathbf x^*) ) = \int_{\mathbf u} p( f(\mathbf x^*) \mid \mathbf u) \: q(\mathbf u) \: d\mathbf u, \quad q(\mathbf u) = \mathcal N( \mathbf m, \mathbf S).$

$$\mathbf u$$ represents the function values at the $$m$$ inducing points. Here, $$\mathbf m \in \mathbb R^m$$ and $$\mathbf S \in \mathbb R^{m \times m}$$ are learnable parameters.

If $$m$$ (the number of inducing points) is quite large, the number of learnable parameters in $$\mathbf S$$ can be quite unwieldy. Furthermore, a large $$m$$ might make some of the computations rather slow. Here we show a few ways to use different variational distributions and variational strategies to accomplish this.

#### Experimental setup¶

We’re going to train an approximate GP on a medium-sized regression dataset, taken from the UCI repository.

:

import urllib.request
import os
from math import floor

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)

if not smoke_test and not os.path.isfile('../elevators.mat'):

if smoke_test:  # this is for running the notebook in our testing framework
X, y = torch.randn(1000, 3), torch.randn(1000)
else:
X = data[:, :-1]
X = X - X.min(0)
X = 2 * (X / X.max(0)) - 1
y = data[:, -1]

train_n = int(floor(0.8 * len(X)))
train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()

test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()

:

from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)

test_dataset = TensorDataset(test_x, test_y)


#### Some quick training/testing code¶

This will allow us to train/test different model classes.

:

# this is for running the notebook in our testing framework
num_epochs = 1 if smoke_test else 10

# Our testing script takes in a GPyTorch MLL (objective function) class
# and then trains/tests an approximate GP with it on the supplied dataset

def train_and_test_approximate_gp(model_cls):
inducing_points = torch.randn(128, train_x.size(-1), dtype=train_x.dtype, device=train_x.device)
model = model_cls(inducing_points)
likelihood = gpytorch.likelihoods.GaussianLikelihood()
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.numel())
optimizer = torch.optim.Adam(list(model.parameters()) + list(likelihood.parameters()), lr=0.1)

if torch.cuda.is_available():
model = model.cuda()
likelihood = likelihood.cuda()

# Training
model.train()
likelihood.train()
epochs_iter = tqdm.notebook.tqdm(range(num_epochs), desc=f"Training {model_cls.__name__}")
for i in epochs_iter:
# Within each iteration, we will go over each minibatch of data
output = model(x_batch)
loss = -mll(output, y_batch)
epochs_iter.set_postfix(loss=loss.item())
loss.backward()
optimizer.step()

# Testing
model.eval()
likelihood.eval()
means = torch.tensor([0.])
preds = model(x_batch)
means = torch.cat([means, preds.mean.cpu()])
means = means[1:]
error = torch.mean(torch.abs(means - test_y.cpu()))
print(f"Test {model_cls.__name__} MAE: {error.item()}")

##### The Standard Approach¶

As a default, we’ll use the default VariationalStrategy class with a CholeskyVariationalDistribution. The CholeskyVariationalDistribution class allows $$\mathbf S$$ to be on any positive semidefinite matrix. This is the most general/expressive option for approximate GPs.

:

class StandardApproximateGP(gpytorch.models.ApproximateGP):
def __init__(self, inducing_points):
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(inducing_points.size(-2))
variational_strategy = gpytorch.variational.VariationalStrategy(
self, inducing_points, variational_distribution, learn_inducing_locations=True
)
super().__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

:

train_and_test_approximate_gp(StandardApproximateGP)


Test StandardApproximateGP MAE: 0.10098349303007126


#### MeanFieldVariationalDistribution: a diagonal $$\mathbf S$$ matrix¶

One way to reduce the number of parameters is to restrict that $$\mathbf S$$ is only diagonal. This is less expressive, but the number of parameters is now linear in $$m$$ instead of quadratic.

All we have to do is take the previous example, and change CholeskyVariationalDistribution (full $$\mathbf S$$ matrix) to MeanFieldVariationalDistribution (diagonal $$\mathbf S$$ matrix).

:

class MeanFieldApproximateGP(gpytorch.models.ApproximateGP):
def __init__(self, inducing_points):
variational_distribution = gpytorch.variational.MeanFieldVariationalDistribution(inducing_points.size(-2))
variational_strategy = gpytorch.variational.VariationalStrategy(
self, inducing_points, variational_distribution, learn_inducing_locations=True
)
super().__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

:

train_and_test_approximate_gp(MeanFieldApproximateGP)


Test MeanFieldApproximateGP MAE: 0.07848489284515381


#### DeltaVariationalDistribution: no $$\mathbf S$$ matrix¶

A more extreme method of reducing parameters is to get rid of $$\mathbf S$$ entirely. This corresponds to learning a delta distribution ($$\mathbf u = \mathbf m$$) rather than a multivariate Normal distribution for $$\mathbf u$$. In other words, this corresponds to performing MAP estimation rather than variational inference.

In GPyTorch, getting rid of $$\mathbf S$$ can be accomplished by using a DeltaVariationalDistribution.

:

class MAPApproximateGP(gpytorch.models.ApproximateGP):
def __init__(self, inducing_points):
variational_distribution = gpytorch.variational.DeltaVariationalDistribution(inducing_points.size(-2))
variational_strategy = gpytorch.variational.VariationalStrategy(
self, inducing_points, variational_distribution, learn_inducing_locations=True
)
super().__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

:

train_and_test_approximate_gp(MAPApproximateGP)


Test MAPApproximateGP MAE: 0.08846496045589447

##### Reducing computation (through decoupled inducing points)¶

One way to reduce the computational complexity is to use separate inducing points for the mean and covariance computations. The Orthogonally Decoupled Variational Gaussian Processes method of Salimbeni et al. (2018) uses more inducing points for the (computationally easy) mean computations and fewer inducing points for the (computationally intensive) covariance computations.

In GPyTorch we implement this method in a modular way. The OrthogonallyDecoupledVariationalStrategy defines the variational strategy for the mean inducing points. It wraps an existing variational strategy/distribution that defines the covariance inducing points:

:

def make_orthogonal_vs(model, train_x):
mean_inducing_points = torch.randn(1000, train_x.size(-1), dtype=train_x.dtype, device=train_x.device)
covar_inducing_points = torch.randn(100, train_x.size(-1), dtype=train_x.dtype, device=train_x.device)

covar_variational_strategy = gpytorch.variational.VariationalStrategy(
model, covar_inducing_points,
gpytorch.variational.CholeskyVariationalDistribution(covar_inducing_points.size(-2)),
learn_inducing_locations=True
)

variational_strategy = gpytorch.variational.OrthogonallyDecoupledVariationalStrategy(
covar_variational_strategy, mean_inducing_points,
gpytorch.variational.DeltaVariationalDistribution(mean_inducing_points.size(-2)),
)
return variational_strategy


Putting it all together we have:

:

class OrthDecoupledApproximateGP(gpytorch.models.ApproximateGP):
def __init__(self, inducing_points):
variational_distribution = gpytorch.variational.DeltaVariationalDistribution(inducing_points.size(-2))
variational_strategy = make_orthogonal_vs(self, train_x)
super().__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

:

train_and_test_approximate_gp(OrthDecoupledApproximateGP)


Test OrthDecoupledApproximateGP MAE: 0.08162340521812439

:

import math
import torch
import gpytorch
import matplotlib.pyplot as plt
%matplotlib inline


### Different objective functions for Approximate GPs¶

(N.B. this tutorial assumes that you are familiar with inducing point methods. For an introduction to these methods, please see Quinonero-Candela and Rasmussen, 2005.)

#### Overview¶

Approximate Gaussian processes learn an approximate posterior distribution

$p(\mathbf f(X) \mid y) \approx q(\mathbf f) = \int( p(\mathbf f \mid \mathbf u) \: q(\mathbf u) d \mathbf u$

where $$\mathbf u$$ are the function values at some inducing points $$\mathbf u$$. Typically, $$q(\mathbf u)$$ is chosen to be a multivariate Gaussian – i.e. $$q(\mathbf u) = \mathcal N (\mathbf m, \mathbf S)$$.

We choose the approximate posterior $$\int( p(\mathbf f \mid \mathbf u) \: q(\mathbf u) d \mathbf u$$ by optimizing the parameters of $$q(\mathbf u)$$ (i.e. $$\mathbf m$$ and $$\mathbf S$$. There are several objectives that we can use for optimization. We’ll test out the following two:

1. The Variational ELBO (see Hensman et al., 2015)
2. The Predictive Log Likelihood (see Jankowiak et al., 2020).
##### Experimental setup¶

We’re going to train an approximate GP on a 1D regression dataset with heteroskedastic noise.

:

# Define some training data
train_x = torch.linspace(0, 1, 100)
train_y = torch.cos(train_x * 2 * math.pi) + torch.randn(100).mul(train_x.pow(3) * 1.)

fig, ax = plt.subplots(1, 1, figsize=(5, 3))
ax.scatter(train_x, train_y, c='k', marker='.', label="Data")
ax.set(xlabel="x", ylabel="y")

:

[Text(0, 0.5, 'y'), Text(0.5, 0, 'x')] Here’s a simple approximate GP model, and a script to optimize/test the model with different objective functions:

:

class ApproximateGPModel(gpytorch.models.ApproximateGP):
def __init__(self, inducing_points):
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(inducing_points.size(-1))
variational_strategy = gpytorch.variational.VariationalStrategy(
self, inducing_points, variational_distribution, learn_inducing_locations=True
)
super().__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iterations = 2 if smoke_test else 50

# Our testing script takes in a GPyTorch MLL (objective function) class
# and then trains/tests an approximate GP with it on the supplied dataset

def train_and_test_approximate_gp(objective_function_cls):
model = ApproximateGPModel(torch.linspace(0, 1))
likelihood = gpytorch.likelihoods.GaussianLikelihood()
objective_function = objective_function_cls(likelihood, model, num_data=train_y.numel())
optimizer = torch.optim.Adam(list(model.parameters()) + list(likelihood.parameters()), lr=0.1)

# Train
model.train()
likelihood.train()
for _ in range(training_iterations):
output = model(train_x)
loss = -objective_function(output, train_y)
loss.backward()
optimizer.step()

# Test
model.eval()
likelihood.eval()
f_dist = model(train_x)
mean = f_dist.mean
f_lower, f_upper = f_dist.confidence_region()
y_dist = likelihood(f_dist)
y_lower, y_upper = y_dist.confidence_region()

# Plot model
fig, ax = plt.subplots(1, 1, figsize=(5, 3))
line, = ax.plot(train_x, mean, "blue")
ax.fill_between(train_x, f_lower, f_upper, color=line.get_color(), alpha=0.3, label="q(f)")
ax.fill_between(train_x, y_lower, y_upper, color=line.get_color(), alpha=0.1, label="p(y)")
ax.scatter(train_x, train_y, c='k', marker='.', label="Data")
ax.legend(loc="best")
ax.set(xlabel="x", ylabel="y")


#### Objective Funtion 1) The Variational ELBO¶

The variational evidence lower bound - or ELBO - is the most common objective function. It can be derived as an lower bound on the likelihood $$p(\mathbf y \! \mid \! \mathbf X)$$:

$\mathcal{L}_\text{ELBO} = \sum_{i=1}^N \mathbb{E}_{q( \mathbf u)} \left[ \mathbb{E}_{f( \mathbf f \mid \mathbf u)} \left[ \log p( y_i \! \mid \! f_i) \right] \right] - \beta \: \text{KL} \left[ q( \mathbf u) \Vert p( \mathbf u) \right]$

where $$N$$ is the number of datapoints and $$p(\mathbf u)$$ is the prior distribution for the inducing function values. For more information on this derivation, see Hensman et al., 2015.

##### How to use the variational ELBO in GPyTorch¶

In GPyTorch, this objective function is available as gpytorch.mlls.VariationalELBO.

:

train_and_test_approximate_gp(gpytorch.mlls.VariationalELBO) #### Objective Funtion 2) The Predictive Log Likelihood¶

The predictive log likelihood is an alternative to the variational ELBO that was proposed in Jankowiak et al., 2020. It typically produces predictive variances than the gpytorch.mlls.VariationalELBO objective.

\begin{align} \mathcal{L}_\text{PLL} &= \mathbb{E}_{p_\text{data}( y, \mathbf x )} \left[ \log p( y \! \mid \! \mathbf x) \right] - \beta \: \text{KL} \left[ q( \mathbf u) \Vert p( \mathbf u) \right] \\ &\approx \sum_{i=1}^N \log \mathbb{E}_{q(\mathbf u)} \left[ \int p( y_i \! \mid \! f_i) p(f_i \! \mid \! \mathbf u) \: d f_i \right] - \beta \: \text{KL} \left[ q( \mathbf u) \Vert p( \mathbf u) \right] \end{align}

Note that this objective is very similar to the variational ELBO. The only difference is that the $$\log$$ occurs outside the expectation $$\mathbb E_{q(\mathbf u)}$$. This difference results in very different predictive performance.

##### How to use the predictive log likelihood in GPyTorch¶

In GPyTorch, this objective function is available as gpytorch.mlls.PredictiveLogLikelihood.

:

train_and_test_approximate_gp(gpytorch.mlls.PredictiveLogLikelihood) ### Non-Gaussian Likelihoods¶

#### Introduction¶

This example is the simplest form of using an RBF kernel in an ApproximateGP module for classification. This basic model is usable when there is not much training data and no advanced techniques are required.

In this example, we’re modeling a unit wave with period 1/2 centered with positive values @ x=0. We are going to classify the points as either +1 or -1.

Variational inference uses the assumption that the posterior distribution factors multiplicatively over the input variables. This makes approximating the distribution via the KL divergence possible to obtain a fast approximation to the posterior. For a good explanation of variational techniques, sections 4-6 of the following may be useful: https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

:

import math
import torch
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

##### Set up training data¶

In the next cell, we set up the training data for this example. We’ll be using 10 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels. Labels are unit wave with period 1/2 centered with positive values @ x=0.

:

train_x = torch.linspace(0, 1, 10)
train_y = torch.sign(torch.cos(train_x * (4 * math.pi))).add(1).div(2)


#### Setting up the classification model¶

The next cell demonstrates the simplest way to define a classification Gaussian process model in GPyTorch. If you have already done the GP regression tutorial, you have already seen how GPyTorch model construction differs from other GP packages. In particular, the GP model expects a user to write out a forward method in a way analogous to PyTorch models. This gives the user the most possible flexibility.

Since exact inference is intractable for GP classification, GPyTorch approximates the classification posterior using variational inference. We believe that variational inference is ideal for a number of reasons. Firstly, variational inference commonly relies on gradient descent techniques, which take full advantage of PyTorch’s autograd. This reduces the amount of code needed to develop complex variational models. Additionally, variational inference can be performed with stochastic gradient decent, which can be extremely scalable for large datasets.

If you are unfamiliar with variational inference, we recommend the following resources: - Variational Inference: A Review for Statisticians by David M. Blei, Alp Kucukelbir, Jon D. McAuliffe. - Scalable Variational Gaussian Process Classification by James Hensman, Alex Matthews, Zoubin Ghahramani.

In this example, we’re using an UnwhitenedVariationalStrategy because we are using the training data as inducing points. In general, you’ll probably want to use the standard VariationalStrategy class for improved optimization.

:

from gpytorch.models import ApproximateGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import UnwhitenedVariationalStrategy

class GPClassificationModel(ApproximateGP):
def __init__(self, train_x):
variational_distribution = CholeskyVariationalDistribution(train_x.size(0))
variational_strategy = UnwhitenedVariationalStrategy(
self, train_x, variational_distribution, learn_inducing_locations=False
)
super(GPClassificationModel, self).__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
latent_pred = gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
return latent_pred

# Initialize model and likelihood
model = GPClassificationModel(train_x)
likelihood = gpytorch.likelihoods.BernoulliLikelihood()

##### Model modes¶

Like most PyTorch modules, the ExactGP has a .train() and .eval() mode. - .train() mode is for optimizing variational parameters model hyperameters. - .eval() mode is for computing predictions through the model posterior.

#### Learn the variational parameters (and other hyperparameters)¶

In the next cell, we optimize the variational parameters of our Gaussian process. In addition, this optimization loop also performs Type-II MLE to train the hyperparameters of the Gaussian process.

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iterations = 2 if smoke_test else 50

# Find optimal model hyperparameters
model.train()
likelihood.train()

# "Loss" for GPs - the marginal log likelihood
# num_data refers to the number of training datapoints
mll = gpytorch.mlls.VariationalELBO(likelihood, model, train_y.numel())

for i in range(training_iterations):
# Zero backpropped gradients from previous iteration
# Get predictive output
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
optimizer.step()

Iter 1/50 - Loss: 0.908
Iter 2/50 - Loss: 4.296
Iter 3/50 - Loss: 8.896
Iter 4/50 - Loss: 3.563
Iter 5/50 - Loss: 5.978
Iter 6/50 - Loss: 6.632
Iter 7/50 - Loss: 6.222
Iter 8/50 - Loss: 4.975
Iter 9/50 - Loss: 3.972
Iter 10/50 - Loss: 3.593
Iter 11/50 - Loss: 3.329
Iter 12/50 - Loss: 2.798
Iter 13/50 - Loss: 2.329
Iter 14/50 - Loss: 2.140
Iter 15/50 - Loss: 1.879
Iter 16/50 - Loss: 1.661
Iter 17/50 - Loss: 1.533
Iter 18/50 - Loss: 1.510
Iter 19/50 - Loss: 1.514
Iter 20/50 - Loss: 1.504
Iter 21/50 - Loss: 1.499
Iter 22/50 - Loss: 1.500
Iter 23/50 - Loss: 1.499
Iter 24/50 - Loss: 1.492
Iter 25/50 - Loss: 1.477
Iter 26/50 - Loss: 1.456
Iter 27/50 - Loss: 1.429
Iter 28/50 - Loss: 1.397
Iter 29/50 - Loss: 1.363
Iter 30/50 - Loss: 1.327
Iter 31/50 - Loss: 1.290
Iter 32/50 - Loss: 1.255
Iter 33/50 - Loss: 1.222
Iter 34/50 - Loss: 1.194
Iter 35/50 - Loss: 1.170
Iter 36/50 - Loss: 1.150
Iter 37/50 - Loss: 1.133
Iter 38/50 - Loss: 1.117
Iter 39/50 - Loss: 1.099
Iter 40/50 - Loss: 1.079
Iter 41/50 - Loss: 1.056
Iter 42/50 - Loss: 1.033
Iter 43/50 - Loss: 1.011
Iter 44/50 - Loss: 0.991
Iter 45/50 - Loss: 0.974
Iter 46/50 - Loss: 0.959
Iter 47/50 - Loss: 0.945
Iter 48/50 - Loss: 0.932
Iter 49/50 - Loss: 0.919
Iter 50/50 - Loss: 0.906


#### Make predictions with the model¶

In the next cell, we make predictions with the model. To do this, we simply put the model and likelihood in eval mode, and call both modules on the test data.

In .eval() mode, when we call model() - we get GP’s latent posterior predictions. These will be MultivariateNormal distributions. But since we are performing binary classification, we want to transform these outputs to classification probabilities using our likelihood.

When we call likelihood(model()), we get a torch.distributions.Bernoulli distribution, which represents our posterior probability that the data points belong to the positive class.

f_preds = model(test_x)
y_preds = likelihood(model(test_x))

f_mean = f_preds.mean
f_samples = f_preds.sample(sample_shape=torch.Size((1000,))

:

# Go into eval mode
model.eval()
likelihood.eval()

# Test x are regularly spaced by 0.01 0,1 inclusive
test_x = torch.linspace(0, 1, 101)
# Get classification predictions
observed_pred = likelihood(model(test_x))

# Initialize fig and axes for plot
f, ax = plt.subplots(1, 1, figsize=(4, 3))
ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
# Get the predicted labels (probabilites of belonging to the positive class)
# Transform these probabilities to be 0/1 labels
pred_labels = observed_pred.mean.ge(0.5).float()
ax.plot(test_x.numpy(), pred_labels.numpy(), 'b')
ax.set_ylim([-1, 2])
ax.legend(['Observed Data', 'Mean']) [ ]:




### Variational GPs w/ Multiple Outputs¶

#### Introduction¶

In this example, we will demonstrate how to construct approximate/variational GPs that can model vector-valued functions (e.g. multitask/multi-output GPs).

:

import math
import torch
import gpytorch
import tqdm
from matplotlib import pyplot as plt

%matplotlib inline

##### Set up training data¶

In the next cell, we set up the training data for this example. We’ll be using 100 regularly spaced points on [0,1] which we evaluate the function on and add Gaussian noise to get the training labels.

We’ll have four functions - all of which are some sort of sinusoid. Our train_targets will actually have two dimensions: with the second dimension corresponding to the different tasks.

:

train_x = torch.linspace(0, 1, 100)

train_y = torch.stack([
torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
torch.cos(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
torch.sin(train_x * (2 * math.pi)) + 2 * torch.cos(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
-torch.cos(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * 0.2,
], -1)

print(train_x.shape, train_y.shape)

torch.Size() torch.Size([100, 4])


We are going to construct a batch variational GP - using a CholeskyVariationalDistribution and a VariationalStrategy. Each of the batch dimensions is going to correspond to one of the outputs. In addition, we will wrap the variational strategy to make the output appear as a MultitaskMultivariateNormal distribution. Here are the changes that we’ll need to make:

1. Our inducing points will need to have shape 2 x m x 1 (where m is the number of inducing points). This ensures that we learn a different set of inducing points for each output dimension.
2. The CholeskyVariationalDistribution, mean module, and covariance modules will all need to include a batch_shape=torch.Size() argument. This ensures that we learn a different set of variational parameters and hyperparameters for each output dimension.
3. The VariationalStrategy object should be wrapped by a variational strategy that handles multitask models. We describe them below:
##### Types of Variational Multitask Models¶

The most general purpose multitask model is the Linear Model of Coregionalization (LMC), which assumes that each output dimension (task) is the linear combination of some latent functions $$\mathbf g(\cdot) = [g^{(1)}(\cdot), \ldots, g^{(Q)}(\cdot)]$$:

$f_\text{task}(\mathbf x) = \sum_{i=1}^Q a^{(i)} g^{(i)}(\mathbf x),$

where $$a^{(i)}$$ are learnable parameters.

:

num_latents = 3

def __init__(self):
# Let's use a different set of inducing points for each latent function
inducing_points = torch.rand(num_latents, 16, 1)

# We have to mark the CholeskyVariationalDistribution as batch
# so that we learn a variational distribution for each task
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
inducing_points.size(-2), batch_shape=torch.Size([num_latents])
)

# We have to wrap the VariationalStrategy in a LMCVariationalStrategy
# so that the output will be a MultitaskMultivariateNormal rather than a batch output
variational_strategy = gpytorch.variational.LMCVariationalStrategy(
gpytorch.variational.VariationalStrategy(
self, inducing_points, variational_distribution, learn_inducing_locations=True
),
num_latents=3,
latent_dim=-1
)

super().__init__(variational_strategy)

# The mean and covariance modules should be marked as batch
# so we learn a different set of hyperparameters
self.mean_module = gpytorch.means.ConstantMean(batch_shape=torch.Size([num_latents]))
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel(batch_shape=torch.Size([num_latents])),
batch_shape=torch.Size([num_latents])
)

def forward(self, x):
# The forward function should be written as if we were dealing with each output
# dimension in batch
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)



With all of the batch_shape arguments - it may look like we’re learning a batch of GPs. However, LMCVariationalStrategy objects convert this batch_dimension into a (non-batch) MultitaskMultivariateNormal.

:

likelihood(model(train_x)).rsample().shape

:

torch.Size([100, 4])


The LMC model allows there to be linear dependencies between outputs/tasks. Alternatively, if we want independent output dimensions, we can replace LMCVariationalStrategy with IndependentMultitaskVariationalStrategy:

:

class IndependentMultitaskGPModel(gpytorch.models.ApproximateGP):
def __init__(self):
# Let's use a different set of inducing points for each task

# We have to mark the CholeskyVariationalDistribution as batch
# so that we learn a variational distribution for each task
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
)

gpytorch.variational.VariationalStrategy(
self, inducing_points, variational_distribution, learn_inducing_locations=True
),
)

super().__init__(variational_strategy)

# The mean and covariance modules should be marked as batch
# so we learn a different set of hyperparameters
self.covar_module = gpytorch.kernels.ScaleKernel(
)


Note that all the batch sizes for IndependentMultitaskVariationalStrategy are now num_tasks rather than num_latents.

##### Train the model¶

This code should look similar to the SVGP training code

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
num_epochs = 1 if smoke_test else 500

model.train()
likelihood.train()

# We use SGD here, rather than Adam. Emperically, we find that SGD is better for variational regression
{'params': model.parameters()},
{'params': likelihood.parameters()},
], lr=0.1)

# Our loss object. We're using the VariationalELBO, which essentially just computes the ELBO
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.size(0))

# We use more CG iterations here because the preconditioner introduced in the NeurIPS paper seems to be less
# effective for VI.
epochs_iter = tqdm.tqdm_notebook(range(num_epochs), desc="Epoch")
for i in epochs_iter:
# Within each iteration, we will go over each minibatch of data
output = model(train_x)
loss = -mll(output, train_y)
epochs_iter.set_postfix(loss=loss.item())
loss.backward()
optimizer.step()



##### Make predictions with the model¶
:

# Set into eval mode
model.eval()
likelihood.eval()

# Initialize plots

# Make predictions
test_x = torch.linspace(0, 1, 51)
predictions = likelihood(model(test_x))
mean = predictions.mean
lower, upper = predictions.confidence_region()

# Plot training data as black stars
# Predictive mean as blue line
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence'])

fig.tight_layout()
None ### GP Regression with Uncertain Inputs¶

#### Introduction¶

In this notebook, we’re going to demonstrate one way of dealing with uncertainty in our training data. Let’s say that we’re collecting training data that models the following function.

\begin{align} y &= \sin(2\pi x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0, 0.2) \end{align}

However, now assume that we’re a bit uncertain about our features. In particular, we’re going to assume that every x_i value is not a point but a distribution instead. E.g.

$x_i \sim \mathcal{N}(\mu_i, \sigma_i).$

#### Using stochastic variational inference to deal with uncertain inputs¶

To deal with this uncertainty, we’ll use variational inference (VI) in conjunction with stochastic optimization. At every optimization iteration, we’ll draw a sample x_i from the input distribution. The objective function (ELBO) that we compute will be an unbiased estimate of the true ELBO, and so a stochastic optimizer like SGD or Adam should converge to the true ELBO (or at least a local minimum of it).

:

import math
import torch
import tqdm
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:

##### Set up training data¶

In the next cell, we set up the training data for this example. We’ll be using 20 regularly spaced points on [0,1]. We’ll represent each of the training points $$x_i$$ by their mean $$\mu_i$$ and variance $$\sigma_i$$.

:

# Training data is 100 points in [0,1] inclusive regularly spaced
train_x_mean = torch.linspace(0, 1, 20)
# We'll assume the variance shrinks the closer we get to 1
train_x_stdv = torch.linspace(0.03, 0.01, 20)

# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x_mean * (2 * math.pi)) + torch.randn(train_x_mean.size()) * 0.2

:

f, ax = plt.subplots(1, 1, figsize=(8, 3))
ax.errorbar(train_x_mean, train_y, xerr=(train_x_stdv * 2), fmt="k*", label="Train Data")
ax.legend()

:

<matplotlib.legend.Legend at 0x12099f470> #### Setting up the model¶

Since we’re performing VI to deal with the feature uncertainty, we’ll be using a ~gpytorch.models.ApproximateGP. Similar to the SVGP example, we’ll use a VariationalStrategy and a CholeskyVariationalDistribution to define our posterior approximation.

:

from gpytorch.models import ApproximateGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import VariationalStrategy

class GPModel(ApproximateGP):
def __init__(self, inducing_points):
variational_distribution = CholeskyVariationalDistribution(inducing_points.size(0))
variational_strategy = VariationalStrategy(self, inducing_points, variational_distribution, learn_inducing_locations=True)
super(GPModel, self).__init__(variational_strategy)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

inducing_points = torch.randn(10, 1)
model = GPModel(inducing_points=inducing_points)
likelihood = gpytorch.likelihoods.GaussianLikelihood()


#### Training the model with uncertain features¶

The training iteration should look pretty similar to the SVGP example – where we optimize the variational parameters and model hyperparameters. The key difference is that, at every iteration, we will draw samples from our features distribution (since we don’t have point measurements of our features).

# Inside the training iteration...
train_x_sample = torch.distributions.Normal(train_x_mean, train_x_stdv).rsample()
# Rest of training iteration...

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iter = 2 if smoke_test else 400

model.train()
likelihood.train()

# We use SGD here, rather than Adam. Emperically, we find that SGD is better for variational regression
{'params': model.parameters()},
{'params': likelihood.parameters()},
], lr=0.01)

# Our loss object. We're using the VariationalELBO
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.size(0))

iterator = tqdm.notebook.tqdm(range(training_iter))
for i in iterator:
# First thing: draw a sample set of features from our distribution
train_x_sample = torch.distributions.Normal(train_x_mean, train_x_stdv).rsample()

# Now do the rest of the training loop
output = model(train_x_sample)
loss = -mll(output, train_y)
iterator.set_postfix(loss=loss.item())
loss.backward()
optimizer.step()



:

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Test points are regularly spaced along [0,1]
# Make predictions by feeding model through likelihood
test_x = torch.linspace(0, 1, 51)
observed_pred = likelihood(model(test_x))

# Initialize plot
f, ax = plt.subplots(1, 1, figsize=(8, 3))

# Get upper and lower confidence bounds
lower, upper = observed_pred.confidence_region()
# Plot training data as black stars
ax.errorbar(train_x_mean.numpy(), train_y.numpy(), xerr=train_x_stdv, fmt='k*')
# Plot predictive means as blue line
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence']) This is a toy example, but it can be useful in practice for more complex datasets where features are more likely to be missing.

## Deep Gaussian Processes¶

### Introduction¶

In this notebook, we provide a GPyTorch implementation of deep Gaussian processes, where training and inference is performed using the method of Salimbeni et al., 2017 (https://arxiv.org/abs/1705.08933) adapted to CG-based inference.

We’ll be training a simple two layer deep GP on the elevators UCI dataset.

:

%set_env CUDA_VISIBLE_DEVICES=0

import torch
import tqdm
import gpytorch
from torch.nn import Linear
from gpytorch.means import ConstantMean, LinearMean
from gpytorch.kernels import RBFKernel, ScaleKernel
from gpytorch.variational import VariationalStrategy, CholeskyVariationalDistribution
from gpytorch.distributions import MultivariateNormal
from gpytorch.models import ApproximateGP, GP
from gpytorch.likelihoods import GaussianLikelihood


env: CUDA_VISIBLE_DEVICES=0

:

from gpytorch.models.deep_gps import DeepGPLayer, DeepGP
from gpytorch.mlls import DeepApproximateMLL


For this example notebook, we’ll be using the elevators UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we’ll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

Note: Running the next cell will attempt to download a ~400 KB dataset file to the current directory.

:

import urllib.request
import os
from math import floor

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)

if not smoke_test and not os.path.isfile('../elevators.mat'):

if smoke_test:  # this is for running the notebook in our testing framework
X, y = torch.randn(1000, 3), torch.randn(1000)
else:
X = data[:, :-1]
X = X - X.min(0)
X = 2 * (X / X.max(0)) - 1
y = data[:, -1]

train_n = int(floor(0.8 * len(X)))
train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()

test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()

:

from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)


### Defining GP layers¶

In GPyTorch, defining a GP involves extending one of our abstract GP models and defining a forward method that returns the prior. For deep GPs, things are similar, but there are two abstract GP models that must be overwritten: one for hidden layers and one for the deep GP model itself.

In the next cell, we define an example deep GP hidden layer. This looks very similar to every other variational GP you might define. However, there are a few key differences:

1. Instead of extending ApproximateGP, we extend DeepGPLayer.
2. DeepGPLayers need a number of input dimensions, a number of output dimensions, and a number of samples. This is kind of like a linear layer in a standard neural network – input_dims defines how many inputs this hidden layer will expect, and output_dims defines how many hidden GPs to create outputs for.

In this particular example, we make a particularly fancy DeepGPLayer that has “skip connections” with previous layers, similar to a ResNet.

:

class ToyDeepGPHiddenLayer(DeepGPLayer):
def __init__(self, input_dims, output_dims, num_inducing=128, mean_type='constant'):
if output_dims is None:
inducing_points = torch.randn(num_inducing, input_dims)
batch_shape = torch.Size([])
else:
inducing_points = torch.randn(output_dims, num_inducing, input_dims)
batch_shape = torch.Size([output_dims])

variational_distribution = CholeskyVariationalDistribution(
num_inducing_points=num_inducing,
batch_shape=batch_shape
)

variational_strategy = VariationalStrategy(
self,
inducing_points,
variational_distribution,
learn_inducing_locations=True
)

super(ToyDeepGPHiddenLayer, self).__init__(variational_strategy, input_dims, output_dims)

if mean_type == 'constant':
self.mean_module = ConstantMean(batch_shape=batch_shape)
else:
self.mean_module = LinearMean(input_dims)
self.covar_module = ScaleKernel(
RBFKernel(batch_shape=batch_shape, ard_num_dims=input_dims),
batch_shape=batch_shape, ard_num_dims=None
)

self.linear_layer = Linear(input_dims, 1)

def forward(self, x):
mean_x = self.mean_module(x) # self.linear_layer(x).squeeze(-1)
covar_x = self.covar_module(x)
return MultivariateNormal(mean_x, covar_x)

def __call__(self, x, *other_inputs, **kwargs):
"""
Overriding __call__ isn't strictly necessary, but it lets us add concatenation based skip connections
easily. For example, hidden_layer2(hidden_layer1_outputs, inputs) will pass the concatenation of the first
hidden layer's outputs and the input data to hidden_layer2.
"""
if len(other_inputs):
x = x.rsample()

processed_inputs = [
inp.unsqueeze(0).expand(self.num_samples, *inp.shape)
for inp in other_inputs
]

x = torch.cat([x] + processed_inputs, dim=-1)

return super().__call__(x, are_samples=bool(len(other_inputs)))


### Building the deep GP¶

Now that we’ve defined a class for our hidden layers and a class for our output layer, we can build our deep GP. To do this, we create a Module whose forward is simply responsible for forwarding through the various layers.

This also allows for various network connectivities easily. For example calling,

hidden_rep2 = self.second_hidden_layer(hidden_rep1, inputs)


in forward would cause the second hidden layer to use both the output of the first hidden layer and the input data as inputs, concatenating the two together.

:

num_output_dims = 2 if smoke_test else 10

class DeepGP(DeepGP):
def __init__(self, train_x_shape):
hidden_layer = ToyDeepGPHiddenLayer(
input_dims=train_x_shape[-1],
output_dims=num_output_dims,
mean_type='linear',
)

last_layer = ToyDeepGPHiddenLayer(
input_dims=hidden_layer.output_dims,
output_dims=None,
mean_type='constant',
)

super().__init__()

self.hidden_layer = hidden_layer
self.last_layer = last_layer
self.likelihood = GaussianLikelihood()

def forward(self, inputs):
hidden_rep1 = self.hidden_layer(inputs)
output = self.last_layer(hidden_rep1)
return output

mus = []
variances = []
lls = []
preds = model.likelihood(model(x_batch))
mus.append(preds.mean)
variances.append(preds.variance)
lls.append(model.likelihood.log_marginal(y_batch, model(x_batch)))


:

model = DeepGP(train_x.shape)
if torch.cuda.is_available():
model = model.cuda()


### Objective function (approximate marginal log likelihood/ELBO)¶

Because deep GPs use some amounts of internal sampling (even in the stochastic variational setting), we need to handle the objective function (e.g. the ELBO) in a slightly different way. To do this, wrap the standard objective function (e.g. ~gpytorch.mlls.VariationalELBO) with a gpytorch.mlls.DeepApproximateMLL.

### Training/Testing¶

The training loop for a deep GP looks similar to a standard GP model with stochastic variational inference.

:

# this is for running the notebook in our testing framework
num_epochs = 1 if smoke_test else 10
num_samples = 3 if smoke_test else 10

{'params': model.parameters()},
], lr=0.01)
mll = DeepApproximateMLL(VariationalELBO(model.likelihood, model, train_x.shape[-2]))

epochs_iter = tqdm.notebook.tqdm(range(num_epochs), desc="Epoch")
for i in epochs_iter:
# Within each iteration, we will go over each minibatch of data
for x_batch, y_batch in minibatch_iter:
with gpytorch.settings.num_likelihood_samples(num_samples):
output = model(x_batch)
loss = -mll(output, y_batch)
loss.backward()
optimizer.step()

minibatch_iter.set_postfix(loss=loss.item())




The output distribution of a deep GP in this framework is actually a mixture of num_samples Gaussians for each output. We get predictions the same way with all GPyTorch models, but we do currently need to do some reshaping to get the means and variances in a reasonable form.

Note that you may have to do more epochs of training than this example to get optimal performance; however, the performance on this particular dataset is pretty good after 10.

:

import gpytorch
import math

test_dataset = TensorDataset(test_x, test_y)

model.eval()

rmse = torch.mean(torch.pow(predictive_means.mean(0) - test_y, 2)).sqrt()
print(f"RMSE: {rmse.item()}, NLL: {-test_lls.mean().item()}")

RMSE: 0.11812344938516617, NLL: 0.09955193847417831

[ ]:




## Deep Sigma Point Processes¶

In this notebook, we provide a GPyTorch implementation of Deep Sigma Point Processes (DSPPs), as described in Jankowiak et al., 2020 (http://www.auai.org/uai2020/proceedings/339_main_paper.pdf).

It will be useful to compare and contrast this notebook with our standard Deep GP notebook, as the computational structure of a Deep GP and a DSPP are quite similar.

:

import gpytorch
import torch
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.means import ConstantMean, LinearMean
from gpytorch.kernels import ScaleKernel, MaternKernel
from gpytorch.variational import VariationalStrategy, BatchDecoupledVariationalStrategy
from gpytorch.variational import MeanFieldVariationalDistribution
from gpytorch.models.deep_gps import DeepGP
from gpytorch.models.deep_gps.dspp import DSPPLayer, DSPP
import gpytorch.settings as settings


### Basic settings¶

In the next cell, we define some basic settings that can be tuned. The only hyperparameter that is DSPP specific is num_quadrature_sites, which effectively determines the number of mixtures that the output distribution will have. hidden_dim controls the width of the hidden GP layer. The other parameters are standard optimization hyperparameters.

:

import os

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)

batch_size = 1000                 # Size of minibatch
milestones = [20, 150, 300]       # Epochs at which we will lower the learning rate by a factor of 0.1
num_inducing_pts = 300            # Number of inducing points in each hidden layer
num_epochs = 400                  # Number of epochs to train for
initial_lr = 0.01                 # Initial learning rate
hidden_dim = 3                    # Number of GPs (i.e., the width) in the hidden layer.
num_quadrature_sites = 8          # Number of quadrature sites (see paper for a description of this. 5-10 generally works well).

## Modified settings for smoke test purposes
num_epochs = num_epochs if not smoke_test else 1


For this example notebook, we’ll be using the bike UCI dataset used in the paper. Running the next cell downloads a copy of the dataset. We will be using the same normalization, randomization, and train/test splitting scheme as used in the paper, although for this demo notebook we do not use a validation set as we won’t be tuning any hyperparameters.

:

import urllib.request
from math import floor

if not smoke_test and not os.path.isfile('../bike.mat'):

if smoke_test:  # this is for running the notebook in our testing framework
X, y = torch.randn(1000, 3), torch.randn(1000)
else:

# Map features to [-1, 1]
X = data[:, :-1]
X = X - X.min(0)
X = 2.0 * (X / X.max(0)) - 1.0

# Z-score labels
y = data[:, -1]
y -= y.mean()
y /= y.std()

shuffled_indices = torch.randperm(X.size(0))
X = X[shuffled_indices, :]
y = y[shuffled_indices]

train_n = int(floor(0.75 * X.size(0)))

train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()
test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()

print(train_x.shape, train_y.shape, test_x.shape, test_y.shape)

torch.Size([13034, 17]) torch.Size() torch.Size([4345, 17]) torch.Size()


### Create PyTorch DataLoader objects¶

As we will be training and predicting on minibatches, we use the standard PyTorch TensorDataset and DataLoader framework to handle getting batches of data.

:

from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)

test_dataset = TensorDataset(test_x, test_y)


### Initialize Hidden Layer Inducing Points¶

Just to match the setup of the paper, we initialize the inducing points for each GP in the hidden layer by using kmeans clustering. The DSPPHiddenLayer class as defined below can also take num_inducing=300, inducing_points=None as arguments to randomly initialize the inducing points. However, we find that using kmeans to initialize can improve optimization in most cases.

:

from scipy.cluster.vq import kmeans2

# Use k-means to initialize inducing points (only helpful for the first layer)
inducing_points = (train_x[torch.randperm(min(1000 * 100, train_n))[0:num_inducing_pts], :])
inducing_points = inducing_points.clone().data.cpu().numpy()
inducing_points = torch.tensor(kmeans2(train_x.data.cpu().numpy(),
inducing_points, minit='matrix'))

if torch.cuda.is_available():
inducing_points = inducing_points.cuda()


### Create The DSPPHiddenLayer Class¶

The next cell is the most important in the notebook. It will likely be instructive to compare the code below to the analogous code cell in our Deep GP notebook, as they are essentially the same. The only difference is some code at the start to handle the fact that we may pass in prespecified inducing point locations, rather than always initializing them randomly.

Regardless, the best way to think of a DSPP (or DGP) hidden layer class is as a standard GPyTorch variational GP class that has two key aspects:

1. It has a batch shape equal to output_dims. In other words, the way we handle a layer of multiple GPs is with a batch dimension. This means that inducing_points, kernel hyperparameters, etc should all have batch_shape=torch.Size([output_dims]).
2. It extends DSPPLayer rather than ApproximateGP.

These are really the only two differences. A DSPPLayer / DGPLayer will still define a variational distribution and strategy, a prior mean and covariance function, and forward is still responsible for returning the prior.

:

class DSPPHiddenLayer(DSPPLayer):
def __init__(self, input_dims, output_dims, num_inducing=300, inducing_points=None, mean_type='constant', Q=8):
if inducing_points is not None and output_dims is not None and inducing_points.dim() == 2:
# The inducing points were passed in, but the shape doesn't match the number of GPs in this layer.
# Let's assume we wanted to use the same inducing point initialization for each GP in the layer,
# and expand the inducing points to match this.
inducing_points = inducing_points.unsqueeze(0).expand((output_dims,) + inducing_points.shape)
inducing_points = inducing_points.clone() + 0.01 * torch.randn_like(inducing_points)
if inducing_points is None:
# No inducing points were specified, let's just initialize them randomly.
if output_dims is None:
# An output_dims of None implies there is only one GP in this layer
# (e.g., the last layer for univariate regression).
inducing_points = torch.randn(num_inducing, input_dims)
else:
inducing_points = torch.randn(output_dims, num_inducing, input_dims)
else:
# Get the number of inducing points from the ones passed in.
num_inducing = inducing_points.size(-2)

# Let's use mean field / diagonal covariance structure.
variational_distribution = MeanFieldVariationalDistribution(
num_inducing_points=num_inducing,
batch_shape=torch.Size([output_dims]) if output_dims is not None else torch.Size([])
)

# Standard variational inference.
variational_strategy = VariationalStrategy(
self,
inducing_points,
variational_distribution,
learn_inducing_locations=True
)

batch_shape = torch.Size([]) if output_dims is None else torch.Size([output_dims])

super(DSPPHiddenLayer, self).__init__(variational_strategy, input_dims, output_dims, Q)

if mean_type == 'constant':
# We'll use a constant mean for the final output layer.
self.mean_module = ConstantMean(batch_shape=batch_shape)
elif mean_type == 'linear':
# As in Salimbeni et al. 2017, we find that using a linear mean for the hidden layer improves performance.
self.mean_module = LinearMean(input_dims, batch_shape=batch_shape)

self.covar_module = ScaleKernel(MaternKernel(batch_shape=batch_shape, ard_num_dims=input_dims),
batch_shape=batch_shape, ard_num_dims=None)

def forward(self, x, mean_input=None, **kwargs):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


### Create the DSPP Class¶

The below creates a DSPP container that is virtually identical to the one used in the DGP setting. All it is responsible for is insantiating the layers of the DSPP (in this case, one hidden layer and one output layer), and then defining a forward method that passes data through both layers.

As in the DGP example notebook, we also define a predict method purely for convenience that takes a DataLoader and returns predictions for every example in that DataLoader.

:

class TwoLayerDSPP(DSPP):
def __init__(self, train_x_shape, inducing_points, num_inducing, hidden_dim=3, Q=3):
hidden_layer = DSPPHiddenLayer(
input_dims=train_x_shape[-1],
output_dims=hidden_dim,
mean_type='linear',
inducing_points=inducing_points,
Q=Q,
)
last_layer = DSPPHiddenLayer(
input_dims=hidden_layer.output_dims,
output_dims=None,
mean_type='constant',
inducing_points=None,
num_inducing=num_inducing,
Q=Q,
)

likelihood = GaussianLikelihood()

super().__init__(Q)
self.likelihood = likelihood
self.last_layer = last_layer
self.hidden_layer = hidden_layer

def forward(self, inputs, **kwargs):
hidden_rep1 = self.hidden_layer(inputs, **kwargs)
output = self.last_layer(hidden_rep1, **kwargs)
return output

mus, variances, lls = [], [], []
preds = self.likelihood(self(x_batch, mean_input=x_batch))
mus.append(preds.mean.cpu())
variances.append(preds.variance.cpu())

# Compute test log probability. The output of a DSPP is a weighted mixture of Q Gaussians,
# with the Q weights specified by self.quad_weight_grid. The below code computes the log probability of each
# test point under this mixture.

# Step 1: Get log marginal for each Gaussian in the output mixture.
base_batch_ll = self.likelihood.log_marginal(y_batch, self(x_batch))

# Step 2: Weight each log marginal by its quadrature weight in log space.

# Step 3: Take logsumexp over the mixture dimension, getting test log prob for each datapoint in the batch.
batch_log_prob = deep_batch_ll.logsumexp(dim=0)
lls.append(batch_log_prob.cpu())


:

model = TwoLayerDSPP(
train_x.shape,
inducing_points,
num_inducing=num_inducing_pts,
hidden_dim=hidden_dim,
)

if torch.cuda.is_available():
model.cuda()

model.train()

:

TwoLayerDSPP(
(likelihood): GaussianLikelihood(
(noise_covar): HomoskedasticNoise(
(raw_noise_constraint): GreaterThan(1.000E-04)
)
)
(last_layer): DSPPHiddenLayer(
(variational_strategy): VariationalStrategy(
(_variational_distribution): MeanFieldVariationalDistribution()
)
(mean_module): ConstantMean()
(covar_module): ScaleKernel(
(base_kernel): MaternKernel(
(raw_lengthscale_constraint): Positive()
)
(raw_outputscale_constraint): Positive()
)
)
(hidden_layer): DSPPHiddenLayer(
(variational_strategy): VariationalStrategy(
(_variational_distribution): MeanFieldVariationalDistribution()
)
(mean_module): LinearMean()
(covar_module): ScaleKernel(
(base_kernel): MaternKernel(
(raw_lengthscale_constraint): Positive()
)
(raw_outputscale_constraint): Positive()
)
)
)

:

from gpytorch.mlls import DeepPredictiveLogLikelihood

# The "beta" parameter here corresponds to \beta_{reg} from the paper, and represents a scaling factor on the KL divergence
# portion of the loss.
objective = DeepPredictiveLogLikelihood(model.likelihood, model, num_data=train_n, beta=0.05)


### Train the Model¶

Below is a standard minibatch training loop.

[ ]:

import tqdm

epochs_iter = tqdm.notebook.tqdm(range(num_epochs), desc="Epoch")

for i in epochs_iter:
for x_batch, y_batch in minibatch_iter:
output = model(x_batch)
loss = -objective(output, y_batch)
loss.backward()
sched.step()


### Make Predictions, compute RMSE and Test NLL¶

:

model.eval()
# means currently contains the predictive output from each Gaussian in the mixture.
# To get the total mean output, we take a weighted sum of these means over the quadrature weights.
rmse = ((weights * means).sum(0) - test_y.cpu()).pow(2.0).mean().sqrt().item()
ll = ll.mean().item()

print('RMSE: ', rmse, 'Test NLL: ', -ll)

RMSE:  0.04274941563606262 Test NLL:  -1.690010404586792

[ ]:




## PyTorch NN Integration (Deep Kernel Learning)¶

Because GPyTorch is built on top of PyTorch, you can seamlessly integrate existing PyTorch modules into GPyTorch models. This makes it possible to combine neural networks with GPs, either with exact or approximate inference.

Here we provide some examples of Deep Kernel Learning, which are GP models that use kernels parameterized by neural networks.

### Exact DKL (Deep Kernel Learning) Regression w/ KISS-GP¶

#### Overview¶

In this notebook, we’ll give a brief tutorial on how to use deep kernel learning for regression on a medium scale dataset using SKI. This also demonstrates how to incorporate standard PyTorch modules in to a Gaussian process model.

:

import math
import tqdm
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline


For this example notebook, we’ll be using the elevators UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we’ll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

Note: Running the next cell will attempt to download a ~400 KB dataset file to the current directory.

:

import urllib.request
import os
from math import floor

# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)

if not smoke_test and not os.path.isfile('../elevators.mat'):

if smoke_test:  # this is for running the notebook in our testing framework
X, y = torch.randn(20, 3), torch.randn(20)
else:
X = data[:, :-1]
X = X - X.min(0)
X = 2 * (X / X.max(0)) - 1
y = data[:, -1]

train_n = int(floor(0.8 * len(X)))
train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()

test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()


#### Defining the DKL Feature Extractor¶

Next, we define the neural network feature extractor used to define the deep kernel. In this case, we use a fully connected network with the architecture d -> 1000 -> 500 -> 50 -> 2, as described in the original DKL paper. All of the code below uses standard PyTorch implementations of neural network layers.

:

data_dim = train_x.size(-1)

class LargeFeatureExtractor(torch.nn.Sequential):
def __init__(self):
super(LargeFeatureExtractor, self).__init__()

feature_extractor = LargeFeatureExtractor()


#### Defining the DKL-GP Model¶

We now define the GP model. For more details on the use of GP models, see our simpler examples. This model uses a GridInterpolationKernel (SKI) with an RBF base kernel.

##### The forward method¶

In deep kernel learning, the forward method is where most of the interesting new stuff happens. Before calling the mean and covariance modules on the data as in the simple GP regression setting, we first pass the input data x through the neural network feature extractor. Then, to ensure that the output features of the neural network remain in the grid bounds expected by SKI, we scales the resulting features to be between 0 and 1.

Only after this processing do we call the mean and covariance module of the Gaussian process. This example also demonstrates the flexibility of defining GP models that allow for learned transformations of the data (in this case, via a neural network) before calling the mean and covariance function. Because the neural network in this case maps to two final output features, we will have no problem using SKI.

:

class GPRegressionModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.GridInterpolationKernel(
gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=2)),
num_dims=2, grid_size=100
)
self.feature_extractor = feature_extractor

def forward(self, x):
# We're first putting our data through a deep net (feature extractor)
# We're also scaling the features so that they're nice values
projected_x = self.feature_extractor(x)
projected_x = projected_x - projected_x.min(0)
projected_x = 2 * (projected_x / projected_x.max(0)) - 1

mean_x = self.mean_module(projected_x)
covar_x = self.covar_module(projected_x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

:

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPRegressionModel(train_x, train_y, likelihood)

if torch.cuda.is_available():
model = model.cuda()
likelihood = likelihood.cuda()

##### Training the model¶

The cell below trains the DKL model above, learning both the hyperparameters of the Gaussian process and the parameters of the neural network in an end-to-end fashion using Type-II MLE. We run 20 iterations of training using the Adam optimizer built in to PyTorch. With a decent GPU, this should only take a few seconds.

:

training_iterations = 2 if smoke_test else 60

# Find optimal model hyperparameters
model.train()
likelihood.train()

{'params': model.feature_extractor.parameters()},
{'params': model.covar_module.parameters()},
{'params': model.mean_module.parameters()},
{'params': model.likelihood.parameters()},
], lr=0.01)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

def train():
iterator = tqdm.notebook.tqdm(range(training_iterations))
for i in iterator:
# Get output from model
output = model(train_x)
# Calc loss and backprop derivatives
loss = -mll(output, train_y)
loss.backward()
iterator.set_postfix(loss=loss.item())
optimizer.step()

%time train()


CPU times: user 45.6 s, sys: 4.29 s, total: 49.8 s
Wall time: 13.7 s

##### Making Predictions¶

The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in preds.mean()) using the standard SKI testing code, with no acceleration or precomputation.

:

model.eval()
likelihood.eval()
preds = model(test_x)

:

print('Test MAE: {}'.format(torch.mean(torch.abs(preds.mean - test_y))))

Test MAE: 0.07007455825805664

[ ]:




### SVDKL (Stochastic Variational Deep Kernel Learning) on CIFAR10/100¶

In this notebook, we’ll demonstrate the steps necessary to train a medium sized DenseNet (https://arxiv.org/abs/1608.06993) on either of two popularly used benchmark dataset in computer vision (CIFAR10 and CIFAR100). We’ll be training the DKL model entirely end to end using the standard 300 Epoch training schedule and SGD.

This notebook is largely for tutorial purposes. If your goal is just to get (for example) a trained DKL + CIFAR100 model, we recommend that you move this code to a simple python script and run that, rather than training directly out of a python notebook. We find that training is just a bit faster out of a python notebook. We also of course recommend that you increase the size of the DenseNet used to a full sized model if you would like to achieve state of the art performance.

Furthermore, because this notebook involves training an actually reasonably large neural network, it is strongly recommended that you have a decent GPU available for this, as with all large deep learning models.

:

from torch.optim import SGD, Adam
from torch.optim.lr_scheduler import MultiStepLR
import torch.nn.functional as F
from torch import nn
import torch
import os
import torchvision.datasets as dset
import torchvision.transforms as transforms
import gpytorch
import math
import tqdm


#### Set up data augmentation¶

The first thing we’ll do is set up some data augmentation transformations to use during training, as well as some basic normalization to use during both training and testing. We’ll use random crops and flips to train the model, and do basic normalization at both training time and test time. To accomplish these transformations, we use standard torchvision transforms.

:

normalize = transforms.Normalize(mean=[0.5071, 0.4867, 0.4408], std=[0.2675, 0.2565, 0.2761])
common_trans = [transforms.ToTensor(), normalize]
train_compose = transforms.Compose(aug_trans + common_trans)
test_compose = transforms.Compose(common_trans)


Next, we create dataloaders for the selected dataset using the built in torchvision datasets. The cell below will download either the cifar10 or cifar100 dataset, depending on which choice is made. The default here is cifar10, however training is just as fast on either dataset.

After downloading the datasets, we create standard torch.utils.data.DataLoaders for each dataset that we will be using to get minibatches of augmented data.

:

dataset = "cifar10"

if ('CI' in os.environ):  # this is for running the notebook in our testing framework
train_set = torch.utils.data.TensorDataset(torch.randn(8, 3, 32, 32), torch.rand(8).round().long())
test_set = torch.utils.data.TensorDataset(torch.randn(4, 3, 32, 32), torch.rand(4).round().long())
num_classes = 2
elif dataset == 'cifar10':
test_set = dset.CIFAR10('data', train=False, transform=test_compose)
num_classes = 10
elif dataset == 'cifar100':
test_set = dset.CIFAR100('data', train=False, transform=test_compose)
num_classes = 100
else:
raise RuntimeError('dataset must be one of "cifar100" or "cifar10"')

Files already downloaded and verified


#### Creating the DenseNet Model¶

With the data loaded, we can move on to defining our DKL model. A DKL model consists of three components: the neural network, the Gaussian process layer used after the neural network, and the Softmax likelihood.

The first step is defining the neural network architecture. To do this, we use a slightly modified version of the DenseNet available in the standard PyTorch package. Specifically, we modify it to remove the softmax layer, since we’ll only be needing the final features extracted from the neural network.

:

from densenet import DenseNet

class DenseNetFeatureExtractor(DenseNet):
def forward(self, x):
features = self.features(x)
out = F.relu(features, inplace=True)
out = F.avg_pool2d(out, kernel_size=self.avgpool_size).view(features.size(0), -1)
return out

feature_extractor = DenseNetFeatureExtractor(block_config=(6, 6, 6), num_classes=num_classes)
num_features = feature_extractor.classifier.in_features


#### Creating the GP Layer¶

In the next cell, we create the layer of Gaussian process models that are called after the neural network. In this case, we’ll be using one GP per feature, as in the SV-DKL paper. The outputs of these Gaussian processes will the be mixed in the softmax likelihood.

:

class GaussianProcessLayer(gpytorch.models.ApproximateGP):
def __init__(self, num_dim, grid_bounds=(-10., 10.), grid_size=64):
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
num_inducing_points=grid_size, batch_shape=torch.Size([num_dim])
)

# Our base variational strategy is a GridInterpolationVariationalStrategy,
# which places variational inducing points on a Grid
# We wrap it with a IndependentMultitaskVariationalStrategy so that our output is a vector-valued GP
gpytorch.variational.GridInterpolationVariationalStrategy(
self, grid_size=grid_size, grid_bounds=[grid_bounds],
variational_distribution=variational_distribution,
)
super().__init__(variational_strategy)

self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel(
lengthscale_prior=gpytorch.priors.SmoothedBoxPrior(
math.exp(-1), math.exp(1), sigma=0.1, transform=torch.exp
)
)
)
self.mean_module = gpytorch.means.ConstantMean()
self.grid_bounds = grid_bounds

def forward(self, x):
mean = self.mean_module(x)
covar = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean, covar)


#### Creating the full SVDKL Model¶

With both the DenseNet feature extractor and GP layer defined, we can put them together in a single module that simply calls one and then the other, much like building any Sequential neural network in PyTorch. This completes defining our DKL model.

:

class DKLModel(gpytorch.Module):
def __init__(self, feature_extractor, num_dim, grid_bounds=(-10., 10.)):
super(DKLModel, self).__init__()
self.feature_extractor = feature_extractor
self.gp_layer = GaussianProcessLayer(num_dim=num_dim, grid_bounds=grid_bounds)
self.grid_bounds = grid_bounds
self.num_dim = num_dim

def forward(self, x):
features = self.feature_extractor(x)
features = gpytorch.utils.grid.scale_to_bounds(features, self.grid_bounds, self.grid_bounds)
# This next line makes it so that we learn a GP for each feature
features = features.transpose(-1, -2).unsqueeze(-1)
res = self.gp_layer(features)
return res

model = DKLModel(feature_extractor, num_dim=num_features)
likelihood = gpytorch.likelihoods.SoftmaxLikelihood(num_features=model.num_dim, num_classes=num_classes)

# If you run this example without CUDA, I hope you like waiting!
if torch.cuda.is_available():
model = model.cuda()
likelihood = likelihood.cuda()


#### Defining Training and Testing Code¶

Next, we define the basic optimization loop and testing code. This code is entirely analogous to the standard PyTorch training loop. We create a torch.optim.SGD optimizer with the parameters of the neural network on which we apply the standard amount of weight decay suggested from the paper, the parameters of the Gaussian process (from which we omit weight decay, as L2 regualrization on top of variational inference is not necessary), and the mixing parameters of the Softmax likelihood.

We use the standard learning rate schedule from the paper, where we decrease the learning rate by a factor of ten 50% of the way through training, and again at 75% of the way through training.

:

n_epochs = 1
lr = 0.1
optimizer = SGD([
{'params': model.feature_extractor.parameters(), 'weight_decay': 1e-4},
{'params': model.gp_layer.hyperparameters(), 'lr': lr * 0.01},
{'params': model.gp_layer.variational_parameters()},
{'params': likelihood.parameters()},
], lr=lr, momentum=0.9, nesterov=True, weight_decay=0)
scheduler = MultiStepLR(optimizer, milestones=[0.5 * n_epochs, 0.75 * n_epochs], gamma=0.1)

def train(epoch):
model.train()
likelihood.train()

minibatch_iter = tqdm.notebook.tqdm(train_loader, desc=f"(Epoch {epoch}) Minibatch")
with gpytorch.settings.num_likelihood_samples(8):
for data, target in minibatch_iter:
if torch.cuda.is_available():
data, target = data.cuda(), target.cuda()
output = model(data)
loss = -mll(output, target)
loss.backward()
optimizer.step()
minibatch_iter.set_postfix(loss=loss.item())

def test():
model.eval()
likelihood.eval()

correct = 0
if torch.cuda.is_available():
data, target = data.cuda(), target.cuda()
output = likelihood(model(data))  # This gives us 16 samples from the predictive distribution
pred = output.probs.mean(0).argmax(-1)  # Taking the mean over all of the sample we've drawn
correct += pred.eq(target.view_as(pred)).cpu().sum()
print('Test set: Accuracy: {}/{} ({}%)'.format(
))


We are now ready to train the model. At the end of each Epoch we report the current test loss and accuracy, and we save a checkpoint model out to a file.

:

for epoch in range(1, n_epochs + 1):
with gpytorch.settings.use_toeplitz(False):
train(epoch)
test()
scheduler.step()
state_dict = model.state_dict()
likelihood_state_dict = likelihood.state_dict()
torch.save({'model': state_dict, 'likelihood': likelihood_state_dict}, 'dkl_cifar_checkpoint.dat')


Test set: Accuracy: 4583/10000 (45.83000183105469%)


We cut training short here. With more training - the network will get as good of accuracy as a standard neural network.

## Pyro Integration¶

GPyTorch can optionally work with the Pyro probablistic programming language. This makes it possible to use Pyro’s advanced inference algorithms, or to incorporate GPs as part of larger probablistic models. GPyTorch offers two ways of integrating with Pyro:

### High-level Pyro Interface (for predictive models)¶

The high-level interface provides a simple wrapper around ApproximateGP that makes it possible to use Pyro’s inference tools with GPyTorch models. It is best designed for:

• Developing models that will be used for predictive tasks
• GPs with likelihoods that have additional latent variables

The Pyro + GPyTorch High-Level Introduction gives an overview of the high-level interface. For a more in-depth example that shows off the power of the integration, see the Clustered Multitask GP Example.

#### Predictions with Pyro + GPyTorch (High-Level Interface)¶

##### Overview¶

In this example, we will give an overview of the high-level Pyro-GPyTorch integration - designed for predictive models. This will introduce you to the key GPyTorch objects that play with Pyro. Here are the key benefits of the integration:

Pyro provides:

• The engines for performing approximate inference or sampling
• The ability to define additional latent variables

GPyTorch provides:

• A library of kernels/means/likelihoods
• Mechanisms for efficient GP computations
:

import math
import torch
import pyro
import tqdm
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline


In this example, we will be doing simple variational regression to learn a monotonic function. This example is doing the exact same thing as GPyTorch’s native approximate inference, except we’re now using Pyro’s variational inference engine.

In general - if this was your dataset, you’d be better off using GPyTorch’s native exact or approximate GPs. (We’re just using a simple example to introduce you to the GPyTorch/Pyro integration).

:

train_x = torch.linspace(0., 1., 21)
train_y = torch.pow(train_x, 2).mul_(3.7)
train_y = train_y.div_(train_y.max())
train_y += torch.randn_like(train_y).mul_(0.02)

fig, ax = plt.subplots(1, 1, figsize=(3, 2))
ax.plot(train_x.numpy(), train_y.numpy(), 'bo')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend(['Training data'])

:

<matplotlib.legend.Legend at 0x11ddf7320> ##### The PyroGP model¶

In order to use Pyro with GPyTorch, your model must inherit from gpytorch.models.PyroGP (rather than gpytorch.modelks.ApproximateGP). The PyroGP extends the ApproximateGP class and differs in a few key ways:

• It adds the model and guide functions which are used by Pyro’s inference engine.
• It’s constructor requires two additional arguments beyond the variational strategy:
• likelihood - the model’s likelihood
• num_data - the total amount of training data (required for minibatch SVI training)
• name_prefix - a unique identifier for the model
:

class PVGPRegressionModel(gpytorch.models.PyroGP):
def __init__(self, train_x, train_y, likelihood):
# Define all the variational stuff
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
num_inducing_points=train_y.numel(),
)
variational_strategy = gpytorch.variational.VariationalStrategy(
self, train_x, variational_distribution
)

# Standard initializtation
super(PVGPRegressionModel, self).__init__(
variational_strategy,
likelihood,
num_data=train_y.numel(),
name_prefix="simple_regression_model"
)
self.likelihood = likelihood

# Mean, covar
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.MaternKernel(nu=1.5)
)

def forward(self, x):
mean = self.mean_module(x)  # Returns an n_data vec
covar = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean, covar)

:

model = PVGPRegressionModel(train_x, train_y, gpytorch.likelihoods.GaussianLikelihood())

##### Performing inference with Pyro¶

Unlike all the other examples in this library, PyroGP models use Pyro’s inference and optimization classes (rather than the classes provided by PyTorch).

If you are unfamiliar with Pyro’s inference tools, we recommend checking out the Pyro SVI tutorial.

:

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
num_iter = 2 if smoke_test else 200
num_particles = 1 if smoke_test else 256

def train(lr=0.01):
elbo = pyro.infer.Trace_ELBO(num_particles=num_particles, vectorize_particles=True, retain_graph=True)
svi = pyro.infer.SVI(model.model, model.guide, optimizer, elbo)
model.train()

iterator = tqdm.notebook.tqdm(range(num_iter))
for i in iterator:
loss = svi.step(train_x, train_y)
iterator.set_postfix(loss=loss)

%time train()


CPU times: user 17.7 s, sys: 460 ms, total: 18.2 s
Wall time: 2.75 s


In this example, we are only performing inference over the GP latent function (and its associated hyperparameters). In later examples, we will see that this basic loop also performs inference over any additional latent variables that we define.

##### Making predictions¶

For some problems, we simply want to use Pyro to perform inference over latent variables. However, we can also use the models’ (approximate) predictive posterior distribution. Making predictions with a PyroGP model is exactly the same as for standard GPyTorch models.

:

fig, ax = plt.subplots(1, 1, figsize=(4, 3))
train_data, = ax.plot(train_x.cpu().numpy(), train_y.cpu().numpy(), 'bo')

model.eval()
output = model.likelihood(model(train_x))

mean = output.mean
lower, upper = output.confidence_region()
line, = ax.plot(train_x.cpu().numpy(), mean.detach().cpu().numpy())
ax.fill_between(train_x.cpu().numpy(), lower.detach().cpu().numpy(),
upper.detach().cpu().numpy(), color=line.get_color(), alpha=0.5)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend([train_data, line], ['Train data', 'Prediction'])

:

<matplotlib.legend.Legend at 0x11e3ffeb8> ##### Next steps¶

This was a pretty boring example, and it wasn’t really all that different from GPyTorch’s native SVGP implementation! The real power of the Pyro integration comes when we have additional latent variables to infer over. We will see an example of this in the next example, which learns a clustering over multiple time series using multitask GPs and Pyro.

#### Clustered Multitask GP (w/ Pyro/GPyTorch High-Level Interface)¶

##### Introduction¶

In this example, we use the Pyro integration for a GP model with additional latent variables.

We are modelling a multitask GP in this example. Rather than assuming a linear correlation among the different tasks, we assume that there is cluster structure for the different tasks. Let’s assume there are $$k$$ different clusters of tasks. The generative model for task $$i$$ is:

$p(\mathbf y_i \mid \mathbf x_i) = \int \sum_{z_i=1}^k p(\mathbf y_i \mid \mathbf f (\mathbf x_i), z_i) \: p(z_i) \: p(\mathbf f (\mathbf x_i) ) \: d \mathbf f$

where $$z_i$$ is the cluster assignment for task $$i$$. There are therefore $$k$$ latent functions $$\mathbf f = [f_1 \ldots f_k]$$, each modelled by a GP, representing each cluster.

Our goal is therefore to infer:

• The latent functions $$f_1 \ldots f_k$$
• The cluster assignments $$z_i$$ for each task
:

import math
import torch
import pyro
import gpytorch
from matplotlib import pyplot as plt

%matplotlib inline

# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)


The standard GPyTorch variational objects will take care of inferring the latent functions $$f_1 \ldots f_k$$. However, we do need to add the additional latent variables $$z_i$$ to the models. We will do so by creating a custom likelihood that models:

$\sum_{z_i=1}^k p(\mathbf y_i \mid \mathbf f (\mathbf x_i), z_i) \: p(z_i)$

GPyTorch’s likelihoods are capable of modeling additional latent variables. Our custom likelihood needs to define the following three functions:

• pyro_model (needs to call through to super().pyro_model at the end), which defines the prior distribution for additional latent variables
• pyro_guide (needs to call through to super().pyro_guide at the end), which defines the variational (guide) distribution for additional latent variables
• forward, which defines the observation distributions conditioned on \mathbf f (\mathbf x_i) and any additional latent variables.
###### The pyro_model function¶

For each task, we will model the cluster assignment with a OneHotCategorical variable, where each cluster has equal probability. The pyro_model function will make a pyro.sample call to this prior distribution and then call the super method:

# self.prior_cluster_logits = torch.zeros(num_tasks, num_clusters)

def pyro_model(self, function_dist, target):
cluster_assignment_samples = pyro.sample(
self.name_prefix + ".cluster_logits",  # self.name_prefix is added by PyroGP
pyro.distributions.OneHotCategorical(logits=self.prior_cluster_logits).to_event(1)
)
return super().pyro_model(
function_dist,
target,
cluster_assignment_samples=cluster_assignment_samples
)


Note that we are adding an additional argument cluster_assignment_samples to the super().pyro_model call. This will pass the cluster assignment samples to the forward call, which is necessary for inference.

###### The pyro_guide function¶

For each task, the variational (guide) diustribution will also be a OneHotCategorical variable, which will be defined by the parameter self.variational_cluster_logits. The pyro_guide function will make a pyro.sample call to this prior distribution and then call the super method:

def pyro_guide(self, function_dist, target):
pyro.sample(
self.name_prefix + ".cluster_logits",  # self.name_prefix is added by PyroGP
pyro.distributions.OneHotCategorical(logits=self.variational_cluster_logits).to_event(1)
)
return super().pyro_guide(function_dist, target)


Note that we are adding an additional argument cluster_assignment_samples to the super().pyro_model call. This will pass the cluster assignment samples to the forward call, which is necessary for inference.

###### The forward function¶

The pyro_model fuction passes the additional keyword argument cluster_assignment_samples to the forward call. Therefore, our forward method will define the conditional probability $$p(\mathbf y_i \mid \mathbf f(\mathbf x), z_i)$$, where $$\mathbf f(\mathbf x)$$ corresponds to the variable function_samples and $$z_i$$ corresponds to the variable cluster_assignment_samples.

In our example $$p(\mathbf y_i \mid \mathbf f(\mathbf x), z_i)$$ corresponds to a Gaussian noise model.

# self.raw_noise is the Gaussian noise parameter
# function_samples is n x k
# cluster_assignment_samples is k x t, where t is the number of tasks

def forward(self, function_samples, cluster_assignment_samples):
return pyro.distributions.Normal(
loc=(function_samples.unsqueeze(-2) * cluster_assignment_samples).sum(-1),
scale=torch.nn.functional.softplus(self.raw_noise).sqrt()
).to_event(1)
# The to_event call is necessary because we are returning a multitask distribution,
# where each task dimension corresponds to each of the t tasks


This is all we need for inference! However, if we want to use this model to make predictions, the cluster_assignment_samples keyword argument will not be passed into the function. Therefore, we need to make sure that forward can handle both inference and predictions:

def forward(self, function_samples, cluster_assignment_samples=None):
if cluster_assignment_samples is None:
# We'll get here at prediction time
# We'll use the variational distribution when making predictions
cluster_assignment_samples = pyro.sample(
self.name_prefix + ".cluster_logits", self._cluster_dist(self.variational_cluster_logits)
)

return pyro.distributions.Normal(
loc=(function_samples.unsqueeze(-2) * cluster_assignment_samples).sum(-1),
scale=torch.nn.functional.softplus(self.raw_noise).sqrt()
).to_event(1)

:

class ClusterGaussianLikelihood(gpytorch.likelihoods.Likelihood):
super().__init__()

# These are parameters/buffers for the cluster assignment latent variables

# The Gaussian observational noise
self.register_parameter("raw_noise", torch.nn.Parameter(torch.tensor(0.0)))

# Other info
self.num_clusters = num_clusters
self.max_plate_nesting = 1

def pyro_guide(self, function_dist, target):
# Here we add the extra variational distribution for the cluster latent variable
pyro.sample(
self.name_prefix + ".cluster_logits",  # self.name_prefix is added by PyroGP
pyro.distributions.OneHotCategorical(logits=self.variational_cluster_logits).to_event(1)
)
return super().pyro_guide(function_dist, target)

def pyro_model(self, function_dist, target):
# Here we add the extra prior distribution for the cluster latent variable
cluster_assignment_samples = pyro.sample(
self.name_prefix + ".cluster_logits",  # self.name_prefix is added by PyroGP
pyro.distributions.OneHotCategorical(logits=self.prior_cluster_logits).to_event(1)
)
return super().pyro_model(function_dist, target, cluster_assignment_samples=cluster_assignment_samples)

def forward(self, function_samples, cluster_assignment_samples=None):
# For inference, cluster_assignment_samples will be passed in
# This bit of code is for when we use the likelihood in the predictive mode
if cluster_assignment_samples is None:
cluster_assignment_samples = pyro.sample(
self.name_prefix + ".cluster_logits", self._cluster_dist(self.variational_cluster_logits)
)

# Now we return the observational distribution, based on the function_samples and cluster_assignment_samples
res = pyro.distributions.Normal(
loc=(function_samples.unsqueeze(-2) * cluster_assignment_samples).sum(-1),
scale=torch.nn.functional.softplus(self.raw_noise).sqrt()
).to_event(1)
return res

##### Constructing the PyroGP model¶

The PyroGP model is essentially the same as the model we used in the simple example, except for two changes

• We now will use our more complicated ClusterGaussianLikelihood
• The latent function should be vector valued to correspond to the k latent functions. As a result, we will learn a batched variational distribution, and use a IndependentMultitaskVariationalStrategy to convert the batched variational distribution into a MultitaskMultivariateNormal distribution.
:

class ClusterMultitaskGPModel(gpytorch.models.pyro.PyroGP):
def __init__(self, train_x, train_y, num_functions=2, reparam=False):
num_data = train_y.size(-2)

# Define all the variational stuff
inducing_points = torch.linspace(0, 1, 64).unsqueeze(-1)
variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
num_inducing_points=inducing_points.size(-2),
batch_shape=torch.Size([num_functions])
)

# Here we're using a IndependentMultitaskVariationalStrategy - so that the output of the
# GP latent function is a MultitaskMultivariateNormal
gpytorch.variational.VariationalStrategy(self, inducing_points, variational_distribution),
)

# Standard initializtation
likelihood = ClusterGaussianLikelihood(train_y.size(-1), num_functions)
super().__init__(variational_strategy, likelihood, num_data=num_data, name_prefix=str(time.time()))
self.likelihood = likelihood
self.num_functions = num_functions

# Mean, covar
self.mean_module = gpytorch.means.ZeroMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
res = gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
return res


This model can now be used to perform inference on cluster assignments, as well as make predictions using the inferred cluster assignments!

### Low-level Pyro Interface (for latent function inference)¶

The low-level interface simply provides tools to compute GP latent functions, and requires users to write their own model() and guide() functions. It is best designed for:

• Performing inference on probabilistic models that involve GPs
• Models with complicated likelihoods

The Pyro + GPyTorch Low-Level Introduction gives an overview of the low-level interface. The Cox Process Example is a more in-depth example of a model that can be built using this interface.

#### Latent Function Inference with Pyro + GPyTorch (Low-Level Interface)¶

##### Overview¶

In this example, we will give an overview of the low-level Pyro-GPyTorch integration. The low-level interface makes it possible to write GP models in a Pyro-style – i.e. defining your own model and guide functions.

These are the key differences between the high-level and low-level interface:

High level interface

• Base class is gpytorch.models.PyroGP.
• GPyTorch automatically defines the model and guide functions for Pyro.
• Best used when prediction is the primary goal

Low level interface

• Base class is gpytorch.models.ApproximateGP.
• User defines the model and guide functions for Pyro.
• Best used when inference is the primary goal
:

import math
import torch
import pyro
import tqdm
import gpytorch
import numpy as np
from matplotlib import pyplot as plt

%matplotlib inline


This example uses a GP to infer a latent function $$\lambda(x)$$, which parameterises the exponential distribution:

\[y \sim \text{Exponential} (\lambda),