Using Natural Gradient Descent with Variational Models

Overview

In this notebook, we’ll demonstrate how to use natural gradient descent when optimizing variational GPyTorch models. This will be the same as the SVGP regression notebook, except we will be using a different optimizer.

What is natural gradient descent (NGD)?

Without going into too much detail, using SGD or Adam isn’t the best way to optimize the parameters of variational Gaussian distributions. Essentially, SGD takes steps assuming that the loss geometry of the parameters is Euclidean. This is a bad assumption for the parameters of many distributions, especially Gaussians. See Agustinus Kristiadi’s blog post for more detail.

Instead, it makes more sense to take gradient steps that have been scaled better for the geometry of the variational distribution’s parameters. Specifically, if \(\mathbf m\) and \(\mathbf S\) are the mean and covariance of our variational Gaussian posterior approximation, then we will achieve faster convergence if we take the following steps:

\begin{align} \begin{bmatrix} \mathbf m \\ \mathbf S \end{bmatrix} \leftarrow \begin{bmatrix} \mathbf m \\ \mathbf S \end{bmatrix} - \alpha \mathcal{\mathbf F}^{-1} \nabla \begin{bmatrix} \mathbf m \\ \mathbf S \end{bmatrix} \end{align}

where \(\alpha\) is a step size and \(\mathbf F\) is the Fisher information matrix corresponding to this distribution. This is known as natural gradient descent, or NGD.

It turns out that for Gaussian distributions (and, more broadly, for all distributions in the exponential family), there are efficient update equations for NGD. See the following papers for more information: - Salimbeni, Hugh, Stefanos Eleftheriadis, and James Hensman. “Natural gradients in practice: Non-conjugate variational inference in gaussian process models.” AISTATS (2018). - Hensman, James, Magnus Rattray, and Neil D. Lawrence. “Fast variational inference in the conjugate exponential family.” NeurIPS (2012).

Jointly optimizing variational parameters/hyperparameters

[1]:
import tqdm
import math
import torch
import gpytorch
from matplotlib import pyplot as plt

# Make plots inline
%matplotlib inline

For this example notebook, we’ll be using the elevators UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we’ll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.

[2]:
import urllib.request
import os
from scipy.io import loadmat
from math import floor


# this is for running the notebook in our testing framework
smoke_test = ('CI' in os.environ)


if not smoke_test and not os.path.isfile('../elevators.mat'):
    print('Downloading \'elevators\' UCI dataset...')
    urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1jhWL3YUHvXIaftia4qeAyDwVxo6j1alk', '../elevators.mat')


if smoke_test:  # this is for running the notebook in our testing framework
    X, y = torch.randn(1000, 3), torch.randn(1000)
else:
    data = torch.Tensor(loadmat('../elevators.mat')['data'])
    X = data[:, :-1]
    X = X - X.min(0)[0]
    X = 2 * (X / X.max(0)[0]) - 1
    y = data[:, -1]


train_n = int(floor(0.8 * len(X)))
train_x = X[:train_n, :].contiguous()
train_y = y[:train_n].contiguous()

test_x = X[train_n:, :].contiguous()
test_y = y[train_n:].contiguous()

if torch.cuda.is_available():
    train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()

The following steps create the dataloader objects. See the SVGP regression notebook for details.

[3]:
from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(train_x, train_y)
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

test_dataset = TensorDataset(test_x, test_y)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)

SVGP models for NGD

There are three key differences between NGD-SVGP models and the standard SVGP models from the SVGP regression notebook.

Difference #1: NaturalVariationalDistribution

Rather than using gpytorch.variational.CholeskyVarationalDistribution (or other variational distribution objects), you have to use one of the two objects:

  • gpytorch.variational.NaturalVariationalDistribution (typically faster optimization convergence, but less stable for non-conjugate likelihoods)
  • gpytorch.variational.TrilNaturalVariationalDistribution (typically slower optimization convergence, but more stable for non-conjugate likelihoods)
[4]:
class GPModel(gpytorch.models.ApproximateGP):
    def __init__(self, inducing_points):
        variational_distribution = gpytorch.variational.NaturalVariationalDistribution(inducing_points.size(0))
        variational_strategy = gpytorch.variational.VariationalStrategy(
            self, inducing_points, variational_distribution, learn_inducing_locations=True
        )
        super(GPModel, self).__init__(variational_strategy)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


inducing_points = train_x[:500, :]
model = GPModel(inducing_points=inducing_points)
likelihood = gpytorch.likelihoods.GaussianLikelihood()

if torch.cuda.is_available():
    model = model.cuda()
    likelihood = likelihood.cuda()

Difference #2: Two optimizers - one for the variational parameters; one for the hyperparameters

NGD steps only update the variational parameters. Therefore, we need two separate optimizers: one for the variational parameters (using NGD) and one for the other hyperparameters (using Adam or whatever you want).

Some things to note about the NGD variational optimizer:

  • You must use ``gpytorch.optim.NGD`` as the variational NGD optimizer! Adaptive gradient algorithms will mess up the natural gradient steps. (Any stochastic optimizer works for the hyperparameters.)
  • Use a large learning rate for the variational optimizer. Typically, 0.1 is a good learning rate.
[5]:
variational_ngd_optimizer = gpytorch.optim.NGD(model.variational_parameters(), num_data=train_y.size(0), lr=0.1)

hyperparameter_optimizer = torch.optim.Adam([
    {'params': model.hyperparameters()},
    {'params': likelihood.parameters()},
], lr=0.01)

Difference #3: The updated training loop

In the training loop, we have to update both optimizers (variational_ngd_optimizer and hyperparameter_optimizer).

[6]:
model.train()
likelihood.train()
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=train_y.size(0))

num_epochs = 1 if smoke_test else 4
epochs_iter = tqdm.notebook.tqdm(range(num_epochs), desc="Epoch")
for i in epochs_iter:
    minibatch_iter = tqdm.notebook.tqdm(train_loader, desc="Minibatch", leave=False)

    for x_batch, y_batch in minibatch_iter:
        ### Perform NGD step to optimize variational parameters
        variational_ngd_optimizer.zero_grad()
        hyperparameter_optimizer.zero_grad()
        output = model(x_batch)
        loss = -mll(output, y_batch)
        minibatch_iter.set_postfix(loss=loss.item())
        loss.backward()
        variational_ngd_optimizer.step()
        hyperparameter_optimizer.step()

You could also modify the optimization loop to alternate between the NGD step and the hyperparameter updates:

for x_batch, y_batch in minibatch_iter:
    ### Perform NGD step to optimize variational parameters
    variational_ngd_optimizer.zero_grad()
    output = model(x_batch)
    loss = -mll(output, y_batch)
    minibatch_iter.set_postfix(loss=loss.item())
    loss.backward()
    variational_ngd_optimizer.step()

    ### Perform Adam step to optimize hyperparameters
    hyperparameter_optimizer.zero_grad()
    output = model(x_batch)
    loss = -mll(output, y_batch)
    loss.backward()
    hyperparameter_optimizer.step()

Evaluation

That’s it! This model should converge must faster/better than the standard SVGP model - and it can often get better performance (especially when using more inducing points).

[7]:
model.eval()
likelihood.eval()
means = torch.tensor([0.])
with torch.no_grad():
    for x_batch, y_batch in test_loader:
        preds = model(x_batch)
        means = torch.cat([means, preds.mean.cpu()])
means = means[1:]
print('Test MAE: {}'.format(torch.mean(torch.abs(means - test_y.cpu()))))
Test MAE: 0.07514254748821259