For the past couple of days we have been busy with setting up and configuring x2 SuperMicro GPU servers 2U which consists of x4 Fermi Nvidia M2090 cards per server. Coupled with that we installed a Mellanox Infiniband SX6036 externally managed switch with Mellanox ConnectX-3 VPI HCA adapters in each server. The infiniband drivers used was from the OpenFrabics distribution OFED 126.96.36.199 which has support for Linux Kernel v3.
I would like to document my findings about setting up OFED on both Scientific Linux 5.8 and SLES 11.
Installation of OFED under SL 5.8 ( RHEL 5.8 )
This was by far the most time consuming process for any troubleshooting exercise I have executed to date. Begin the installation by executing the ./install script and select the preferred option. The first thing to compile is the kernel-ib and kernel-devel. These will break. It should terminate with a error displayed below.
pci.h:164: error: conflicting types for 'pci_pcie_cap OFED
OpenFrabics do not have a fix for it on their website from what we could see and decided to tinker around a bit. We found a developer hiding deep in the search archives of google who had made modifications to the source RPM of the OFED-188.8.131.52 OFA kernel. The fix is as follows:
1. tar xfvz OFED-184.108.40.206.tgz
2. Change directory into the extracted OFED-220.127.116.11/SRPMS
3. Download the modified version from here - http://troels.arvin.dk/code/ofa_kernel/rhel5.8-enablement/modified-ofa_kernel-18.104.22.168-OFED.22.214.171.124.src.rpm and rename from modified-ofa_kernel-126.96.36.199-OFED.188.8.131.52.src.rpm to ofa_kernel-184.108.40.206-OFED.220.127.116.11.src.rpm. Do not forget to backup the current file, although you can retrieve it from the tar.gz.
4. Re-run the install script and you should be good.
You might also stumble across a compile error of one of the networking modules from iWARP named “ cxgb3 and cxgb4 “. I do not have these devices present so I can disable them from being compiled and causing the error. Disable the option in ofed.conf . This file is created during the first run of the installation script. You can also find a sample copy of in the docs/ folder. Edit the file, find the driver in the list and set the value to “n”. Re-run the installation script.
The above issue with the “ cxgb3 and cxgb4 “ modules occurred on SLES 11 as well. Do not disable the kernel-ib and kernel-devel modules from being compiled as if you disable these will not provide you with all the init scripts and correct kernel modules.