GPU Power! NVIDIA Tesla in Azure VMs

I am building an analytics system that deploys containers on top of the Azure NCasT4_v3-series virtual machines which are powered by Nvidia Tesla T4 GPUs and AMD EPYC 7V12(Rome) CPUs. I am deploying the VM from an Azure DevOps pipeline using Hashicorp Packer and after trying a few ways I found a very easy way to deploy the VM, Driver and Cuda Toolkit which I will share in this article.

I. Pick the OS from the Azure Marketplace

Ubuntu 20.04 Packer selection from Azure Marketplace

II. VM Size for Packer

Usually you can do your gold image builds in a smaller machine than what you are going to deploy in PROD but in this case stick with the same so that the driver can install without issue, at least stay within the same VM series.

"vm_size": "Standard_NC4as_T4_v3"

III. NVIDIA Drivers and CUDA Toolkit

Documentation from Microsoft is outdated, they provide an example for Ubuntu 16.04 and recommend to download the package with wget, etc. NVIDIA’s docs are also a little dated and recommend wget and running a shell script.

The easiest way for me was to add the Graphics Drivers PPA repo and use apt-get to pull the version I need.

"provisioners": [
        {
        "execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
        "inline": [
          "sudo add-apt-repository ppa:graphics-drivers/ppa",
          "sudo apt-get update",
          "sudo DEBIAN_FRONTEND=noninteractive apt-get -qq install -y gcc nvidia-driver-470"

If not sure you can use ubuntu-drivers devices to scan and list the drivers to install.

After packer was done I spun up VMs with Terraform and logged in to check:

Sources:

NVIDIA Install Tesla: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html

Azure GPU VM Driver installation: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup