hardware Archives - Page 3 of 6

High GPU utilization with no processes running

Mar 29, 2018

We noticed a while back that several of our GPU cards retained high utilization even though no processes were running on them. nvidia-smi Fri Mar 23 13:52:55 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.26 Driver Version: 375.26 | |-------------------------------+----------------------+----------------------+ | GPU Name…

Update feedback

Mar 13, 2017

hardware, hpc, Infiniband

The beegfs cluster was updated to 2015.03.r23. The fhgfs volume is back up and mounted on all nodes. During the Infiniband switch firmware upgrade an error was encountered. We have logged a support call with Mellanox regarding this. The switch …

Maintenance slot 13 March

Mar 3, 2017

BeeGFS, hardware, Infiniband

The BeeGFS (fhgfs) cluster will be offline on Monday 13th March from 09:00 to 17:00 for a major update. Please ensure that all jobs referencing the BeeGFS volume, /researchdata/fhgfs, are completed before 09:00. The firmware on the Mellanox Infiniband switch …

Mellanox MLNX_OFED_LINUX driver update issue

Nov 10, 2016

hardware, Infiniband, operating system

This is an updated entry for the issue we encountered last year upgrading our HPC servers and Infiniband drivers. An updated installation ISO needs to be created that allows kernel support for the newly updated kernel. To create the ISO…

New GPU cards

Aug 24, 2016

GPU, hardware, hpc

We have completed the upgrade to the GPU portion of the hex cluster: – Installed new GPU004 server with two nVidia K40 cards. – Two additional nVidia K40 cards added to GPU003. This brings the number of GPU cards in…

GPU004

Jul 22, 2016

GPU, hardware, hpc

The new GPU server, srvslsgpu004, is up and running. Still to be configured is the Infiniband card and the BGFS volume. The server is being tested and will remain offline until next week. In the server are 2 x 10…

High memory nodes downtime

Jul 6, 2016

hardware

Two of the high memory nodes, 801 and 802, are being moved to a new rack. They will be unavailable until 14:00 7th July.…

June maintenance slot

Jun 6, 2016

hardware, hpc

ICTS will be conducting power maintenance in their data centers on Sunday the 26th of June between 09:00 and 17:00. The Bremner data center will be shut down completely and hence the Hal Slurm cluster will be offline. We will…

HPC January maintenance

Dec 7, 2015

hardware, hpc, operating system

The ICTS hex cluster will be down for scheduled maintenance from Monday January 11th 09:00 to Tuesday January 12th 17:00. The head node, data node and all worker nodes will be patched and rebooted, hence all jobs should be canceled…

The next HPC hurdle: Co-Design

Nov 16, 2015

hardware, hpc, MPI

Co-Design is a new HPC infrastructure design, essentially the next level of performance and scale. Co-design in a nutshell is the ability to reduce as much of the CPU cycle operations to a design whereby there are synergies created between…

hardware

UCTHPC