One of our users has been having difficulty migrating his code to our new cluster. We'd installed a newer version of python (2.7.3) and it would appear that in conjunction with mpi4py and OpenMPI the --mca run time tuning and debugging argument is ignored. He is using the fork() call and while MPI discourages this due to possible memory corruption it really has no bearing as the processing is done by a seperately called executable. The exact same code runs fine on our Scientific Linux cluster with OpenMPI 1.4 but dies on our new cluster with OpenMPI 1.6. It also dies when running on the CHPC SUN cluster which also uses OpenMPI 1.4 so the version of OpenMPI is not the issue. The job was dying with the following error as soon as fork() was called:
An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: srvslshpc600 (PID 38708)
MPI_COMM_WORLD rank: 0
If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0.
Adding --mca mpi_warn_on_fork 0 to the mpirun statement resulted in:
mpiexec has exited due to process rank 0 with PID 47536 on node srvslshpc600 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be terminated by signals sent by mpiexec (as reported here).
While it would probably be best for him to have recoded his software from scratch we found that by installing Python 2.7.2 the code runs successfully. We have no idea why exactly but the answer would probably be obscure and annoying so we'll leave it at that.