i. Prerequisites
- Make sure that the server / node has an infiniband NIC (Mellanox/NVIDIA)
- You should be able to check this by
lspci -v | grep "Mellanox"
or lspci -v | grep -i "infiniband"
- Make sure the Infiniband NIC from the node is physically connected to an Infiniband Switch
- There is an IB switch Mellanox SB 7800, on Row 10 in CCB 247
- If you are connecting it to a different switch, ensure that the switch is configured (I’ve found it better to use a Managed Switch like SB 7800 because it has it’s own Subnet Manager. You can skip the OpenSM setup with this)
- Ensure that the speeds of Node NIC ←→ Cable and Cable ←→ Switch are compatible (all EDR, all HDR etc.). If they’re not compatible, one of them WILL be the bottleneck for speed
- In Row 10, the switch supports EDR, the firefly and SysML nodes support EDR and we got cables from PACE that support EDR
ii. Switch OS and MLNX-OFED Driver
It is important for the Switch OS to be upgraded so that it supports EDR. The MLNX OS on the switch and the MLNX-OFED driver have to be compatible.
Note: It’s unclear what the best way to install the MLNX-OFED driver on the node is. Following the steps from https://docs.nvidia.com/networking/display/MLNXOFEDv461000/Installing+Mellanox+OFED using the file downloaded from https://developer.nvidia.com/networking/infiniband-software causes problems with the ETHERNET interface. But installing MLNX-OFED driver SEEMS TO BE crucial to the complete configuration of Infiniband!
Upgrading the MLNX OS on the Switch
- Ensure that the Switch OS / Firmware and the MLNX-OS driver are compatible (Unverified!)
- If the Switch OS and Driver are compatible even if all are EDR, it will not be realised
- [DON’T DO THIS STEP] Upgrade the Switch MLNX-OS by following steps here:
- https://enterprise-support.nvidia.com/s/article/howto-upgrade-switch-os-software-on-mellanox-switch-systems
- Get OS file from: https://network.nvidia.com/support/firmware/lenovo-intelligent-cluster/
Installing MLNX-OFED driver on the node
- [DON’T DO THIS STEP] Install the MLNX OFED driver on the node (Unverified!)
- NOTE: Installing the OFED driver screws up ethernet. Don’t do this step until it’s fixed. If already installed, can be fixed by uninstalling. Just run
ofed_uninstall.sh
- Get the driver for your OS from https://developer.nvidia.com/networking/infiniband-software
- Download the file onto your server via
wget
- Automatic download to your laptop will start once you accept the terms. Go to the downloads window, copy download link and then wget
it
- Mount the driver iso file. Example
mount -o ro,loop MLNX_OFED_LINUX-23.04-1.1.3.0-ubuntu20.04-x86_64.iso /mnt
- Install the driver
/mnt/mlnxofedinstall --without-dkms --add-kernel-support --kernel <kernel_version> --without-fw-update --force
- Get <kernel_version> in d. using
uname -r
iii. Configuring Infiniband on the node
Preliminary checks
- Double check that cables from the node are connected to the IB switch
- Make sure that the IB switch is configured and has a Subnet Manager running
- If the switch doesn’t have a Subnet Manager running on it, run OpenSM on one of the nodes (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_infiniband_and_rdma_networks/index#configuring-an-infiniband-subnet-manager_configuring-infiniband-and-rdma-networks)
Install relevant OS packages
apt install infiniband-diags
ibverbs-utils
mstflint
ibverbs-utils
Detecting IB interfaces