We noticed a while back that several of our GPU cards retained high utilization even though no processes were running on them.
nvidia-smi
Fri Mar 23 13:52:55 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 0000:04:00.0 Off | 0 |
| N/A 43C P0 79W / 235W | 0MiB / 11439MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40m Off | 0000:82:00.0 Off | Off |
| N/A 46C P0 77W / 235W | 0MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The fix for this is to disable ECC support on the cards:
nvidia-smi -i 0 -e 0
where -i is the ID of the GPU and 0 sets ECC to off. The server needs to be rebooted once this is done. The fix for this is sticky unlike the fix for persistence mode which needs to be set in profile.sh.