Last week we experienced an issue with executing a particular application written in Fortran on the HPC cluster. Submitting a job which executes the application via the portable batch system caused the above error. However, executing the application on a worker node as the user without the submission processes worked perfectly fine. This indicated a environmental mismatch. What I mean by ” environment mismatch ” is that something somewhere between what PBS is configured with and the standard user Linux login environment, is different. So we set to task to see what the issue could be.
Our first clue in the above error was “segmentation” which generally points to a memory issue. Unfortunately, there was ample available memory on the worker node executing the process. Then we had a look at the user limits which are set in the environment and in particular the stacksize. Executing ” ulimit -a ” will provide you with a list of user limits which are set for a particular user environment. The limits can also be set globally in the “/etc/security/limits.conf ”
It turns out the stacksize was set to unlimited for the PBS environment and for standard users it is set to the default of ” 8192“. Having the stacksize set to unlimited is actually recommended for most HPC applications. For this Fortran application though it did not like the fact that the the stacksize was set to unlimited. We submitted the job via the batch system again but this time included a “ ulimit -s 8192 ” to set the ulimit stacksize before application executing and it ran through successfully without segmentation faults.
We came to the conclusion that the Fortran application code could have been written to accept a value from the user limits for the stacksize and not a string. The application code should really be written to accept both options. This is unfortunate and the developer should be informed.