Memory issues

Jul 2, 2011hpc, operating system

We're currently investigating a memory issue with some of the worker nodes. Memory is not being freed up after jobs complete. UPDATE - 4 July: Turns out it's not a memory error. The problem is the way that net-snmp monitors and returns memory information. Unfortunately net-snmp clumps both real and cached memory together, so over time it looks as if memory available drops to zero. See this post for more info. In order to fix this some jiggery-pokery may be required. UPDATE - 5 July: The issue is now resolved. By creating a custom snmp OID with a perl script which gets its information straight from /etc/proc we can get a far better idea of actual memory in use. This is displayed on the dashboard and also tracked in real time by our monitoring systems. New memory display

Memory issues

UCTHPC