i. Prerequisites

  1. Make sure that the server / node has an infiniband NIC (Mellanox/NVIDIA)
    1. You should be able to check this by lspci -v | grep "Mellanox" or lspci -v | grep -i "infiniband"
  2. Make sure the Infiniband NIC from the node is physically connected to an Infiniband Switch
    1. There is an IB switch Mellanox SB 7800, on Row 10 in CCB 247
    2. If you are connecting it to a different switch, ensure that the switch is configured (I’ve found it better to use a Managed Switch like SB 7800 because it has it’s own Subnet Manager. You can skip the OpenSM setup with this)
  3. Ensure that the speeds of Node NIC ←→ Cable and Cable ←→ Switch are compatible (all EDR, all HDR etc.). If they’re not compatible, one of them WILL be the bottleneck for speed
    1. In Row 10, the switch supports EDR, the firefly and SysML nodes support EDR and we got cables from PACE that support EDR

ii. Switch OS and MLNX-OFED Driver

It is important for the Switch OS to be upgraded so that it supports EDR. The MLNX OS on the switch and the MLNX-OFED driver have to be compatible.

Note: It’s unclear what the best way to install the MLNX-OFED driver on the node is. Following the steps from https://docs.nvidia.com/networking/display/MLNXOFEDv461000/Installing+Mellanox+OFED using the file downloaded from https://developer.nvidia.com/networking/infiniband-software causes problems with the ETHERNET interface. But installing MLNX-OFED driver SEEMS TO BE crucial to the complete configuration of Infiniband!

Upgrading the MLNX OS on the Switch

  1. Ensure that the Switch OS / Firmware and the MLNX-OS driver are compatible (Unverified!)
    1. If the Switch OS and Driver are compatible even if all are EDR, it will not be realised
    2. [DON’T DO THIS STEP] Upgrade the Switch MLNX-OS by following steps here:
      1. https://enterprise-support.nvidia.com/s/article/howto-upgrade-switch-os-software-on-mellanox-switch-systems
      2. Get OS file from: https://network.nvidia.com/support/firmware/lenovo-intelligent-cluster/

Installing MLNX-OFED driver on the node

  1. [DON’T DO THIS STEP] Install the MLNX OFED driver on the node (Unverified!)
    1. NOTE: Installing the OFED driver screws up ethernet. Don’t do this step until it’s fixed. If already installed, can be fixed by uninstalling. Just run ofed_uninstall.sh
    2. Get the driver for your OS from https://developer.nvidia.com/networking/infiniband-software
    3. Download the file onto your server via wget - Automatic download to your laptop will start once you accept the terms. Go to the downloads window, copy download link and then wget it
    4. Mount the driver iso file. Example mount -o ro,loop MLNX_OFED_LINUX-23.04-1.1.3.0-ubuntu20.04-x86_64.iso /mnt
    5. Install the driver /mnt/mlnxofedinstall --without-dkms --add-kernel-support --kernel <kernel_version> --without-fw-update --force
    6. Get <kernel_version> in d. using uname -r

iii. Configuring Infiniband on the node

Preliminary checks

  1. Double check that cables from the node are connected to the IB switch
  2. Make sure that the IB switch is configured and has a Subnet Manager running
    1. If the switch doesn’t have a Subnet Manager running on it, run OpenSM on one of the nodes (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_infiniband_and_rdma_networks/index#configuring-an-infiniband-subnet-manager_configuring-infiniband-and-rdma-networks)

Install relevant OS packages

apt install infiniband-diags ibverbs-utils mstflint ibverbs-utils

Detecting IB interfaces