The data center where hex will reside is still being worked on. The spiders nest of network cables under the raised floor has been cleaned out to allow better air flow. All networking will now run in overhead…
It has been a long time in coming, several years in fact, but the first of our new worker nodes went live today on our POC cluster. There is still much to be done, Infiniband is not yet ready, the…
The new GPU nodes have been installed into our POC cluster and are currently running jobs. We installed the servers into our SLURM cluster to allow us to develop hands on experience in provisioning SLURM partitions with GPU resources, something …
We noticed a while back that several of our GPU cards retained high utilization even though no processes were running on them. nvidia-smi Fri Mar 23 13:52:55 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.26 Driver Version: 375.26 | |-------------------------------+----------------------+----------------------+ | GPU Name…
The beegfs cluster was updated to 2015.03.r23. The fhgfs volume is back up and mounted on all nodes. During the Infiniband switch firmware upgrade an error was encountered. We have logged a support call with Mellanox regarding this. The switch …
The BeeGFS (fhgfs) cluster will be offline on Monday 13th March from 09:00 to 17:00 for a major update. Please ensure that all jobs referencing the BeeGFS volume, /researchdata/fhgfs, are completed before 09:00. The firmware on the Mellanox Infiniband switch …
This is an updated entry for the issue we encountered last year upgrading our HPC servers and Infiniband drivers. An updated installation ISO needs to be created that allows kernel support for the newly updated kernel. To create the ISO…
We have completed the upgrade to the GPU portion of the hex cluster: – Installed new GPU004 server with two nVidia K40 cards. – Two additional nVidia K40 cards added to GPU003. This brings the number of GPU cards in…