MPI affinity

Aug 6, 2015hpc, maui, MPI, SLURM, torque

MPI socket vs core affinity

When running a MPI job you can bind your threads to either a socket or to individual cores. There is no one best solution, the choice depends completely on your task and the way it deals with memory. For instance binding all cores to a socket may speed up the task by sharing L1 cache memory, binding to cores on a socket may do the same by making use of L3 cache. Doing this in MAUI (as installed on hex) and SLURM require different approaches.

HEX

On the hex cluster which runs Torque and MAUI one references mpirun directly from the script. The default affinity is set in /usr/mpi/gcc/openmpi-1.6.5/etc/openmpi-mca-params.conf. On Hex there is no affinity setting, threads will run where ever the scheduler decides to place them. However a user can specify affinity with the mpirun command. In the example below a user requests 20 cores with core affinity. MPI and Torque assign 20 individual cores, necessitating spanning two sockets as each socket has 16 cores:

mpirun -machinefile host -np 20 --bind-to-core --report-bindings myprog.o

[srvslshpc610:57772] MCW rank 16 bound to socket 1[core 0]:  [................][B...............][................][................]
[srvslshpc610:57772] MCW rank 17 bound to socket 1[core 1]:  [................][.B..............][................][................]
[srvslshpc610:57772] MCW rank 18 bound to socket 1[core 2]:  [................][..B.............][................][................]
[srvslshpc610:57772] MCW rank 19 bound to socket 1[core 3]:  [................][...B............][................][................]
[srvslshpc610:57772] MCW rank 0  bound to socket 0[core 0]:  [B...............][................][................][................]
[srvslshpc610:57772] MCW rank 1  bound to socket 0[core 1]:  [.B..............][................][................][................]
[srvslshpc610:57772] MCW rank 2  bound to socket 0[core 2]:  [..B.............][................][................][................]
[srvslshpc610:57772] MCW rank 3  bound to socket 0[core 3]:  [...B............][................][................][................]
[srvslshpc610:57772] MCW rank 4  bound to socket 0[core 4]:  [....B...........][................][................][................]
[srvslshpc610:57772] MCW rank 5  bound to socket 0[core 5]:  [.....B..........][................][................][................]
[srvslshpc610:57772] MCW rank 6  bound to socket 0[core 6]:  [......B.........][................][................][................]
[srvslshpc610:57772] MCW rank 7  bound to socket 0[core 7]:  [.......B........][................][................][................]
[srvslshpc610:57772] MCW rank 8  bound to socket 0[core 8]:  [........B.......][................][................][................]
[srvslshpc610:57772] MCW rank 9  bound to socket 0[core 9]:  [.........B......][................][................][................]
[srvslshpc610:57772] MCW rank 10 bound to socket 0[core 10]: [..........B.....][................][................][................]
[srvslshpc610:57772] MCW rank 11 bound to socket 0[core 11]: [...........B....][................][................][................]
[srvslshpc610:57772] MCW rank 12 bound to socket 0[core 12]: [............B...][................][................][................]
[srvslshpc610:57772] MCW rank 13 bound to socket 0[core 13]: [.............B..][................][................][................]
[srvslshpc610:57772] MCW rank 14 bound to socket 0[core 14]: [..............B.][................][................][................]
[srvslshpc610:57772] MCW rank 15 bound to socket 0[core 15]: [...............B][................][................][................]

or more realistically on a node that is already used by other processes:

[srvslshpc605:39122] MCW rank 4  bound to socket 0[core 4]:  [....B.][..............][][]
[srvslshpc605:39122] MCW rank 5  bound to socket 0[core 5]:  [.....B][..............][][]
[srvslshpc605:39122] MCW rank 6  bound to socket 1[core 0]:  [......][B.............][][]
[srvslshpc605:39122] MCW rank 7  bound to socket 1[core 1]:  [......][.B............][][]
[srvslshpc605:39122] MCW rank 8  bound to socket 1[core 2]:  [......][..B...........][][]
[srvslshpc605:39122] MCW rank 9  bound to socket 1[core 3]:  [......][...B..........][][]
[srvslshpc605:39122] MCW rank 10 bound to socket 1[core 4]:  [......][....B.........][][]
[srvslshpc605:39122] MCW rank 11 bound to socket 1[core 5]:  [......][.....B........][][]
[srvslshpc605:39122] MCW rank 12 bound to socket 1[core 6]:  [......][......B.......][][]
[srvslshpc605:39122] MCW rank 13 bound to socket 1[core 7]:  [......][.......B......][][]
[srvslshpc605:39122] MCW rank 14 bound to socket 1[core 8]:  [......][........B.....][][]
[srvslshpc605:39122] MCW rank 15 bound to socket 1[core 9]:  [......][.........B....][][]
[srvslshpc605:39122] MCW rank 16 bound to socket 1[core 10]: [......][..........B...][][]
[srvslshpc605:39122] MCW rank 17 bound to socket 1[core 11]: [......][...........B..][][]
[srvslshpc605:39122] MCW rank 18 bound to socket 1[core 12]: [......][............B.][][]
[srvslshpc605:39122] MCW rank 19 bound to socket 1[core 13]: [......][.............B][][]
[srvslshpc605:39122] MCW rank 0  bound to socket 0[core 0]:  [B.....][..............][][]
[srvslshpc605:39122] MCW rank 1  bound to socket 0[core 1]:  [.B....][..............][][]
[srvslshpc605:39122] MCW rank 2  bound to socket 0[core 2]:  [..B...][..............][][]
[srvslshpc605:39122] MCW rank 3  bound to socket 0[core 3]:  [...B..][..............][][]

In the following example a user requests 20 cores with socket affinity. MPI and Torque assign 2 sockets and shares the threads across all 32 cores:

mpirun -machinefile host -np 20 --bind-to-socket --report-bindings myprog.o

[srvslshpc610:57878] MCW rank 16 bound to socket 1[core 0-15]: [................][BBBBBBBBBBBBBBBB][................][................]
[srvslshpc610:57878] MCW rank 17 bound to socket 1[core 0-15]: [................][BBBBBBBBBBBBBBBB][................][................]
[srvslshpc610:57878] MCW rank 18 bound to socket 1[core 0-15]: [................][BBBBBBBBBBBBBBBB][................][................]
[srvslshpc610:57878] MCW rank 19 bound to socket 1[core 0-15]: [................][BBBBBBBBBBBBBBBB][................][................]
[srvslshpc610:57878] MCW rank  0 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  1 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  2 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  3 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  4 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  5 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  6 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  7 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  8 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank  9 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank 10 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank 11 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank 12 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank 13 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank 14 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]
[srvslshpc610:57878] MCW rank 15 bound to socket 0[core 0-15]: [BBBBBBBBBBBBBBBB][................][................][................]

This will fail if there are no free sockets, i.e. if the scheduler has assigned even 1 thread to all the sockets then the MPI request cannot be satisfied.

There’s a good Q&A on this on StackOverflow although it only speaks to the previous version of OpenMPI, 1.7 whereas we’re running 1.6.5 on hex

SLURM

On the SLURM cluster one references mpirun via the srun command. The default affinity is set in /usr/local/etc/openmpi-mca-params.conf. On the SLURM cluster there is no affinity setting, threads will run where ever the scheduler decides
to place them. However a user can specify affinity with the srun command. As there is no report binding we have used OpenMPI extensions to enhance the reporting as slurm’s
–cpu_bind=verbose is a bit cryptic. The ompi_bound and current_binding variables are called with OMPI_Affinity_str() which requires mpi-ext.h. It is necessary to compile OpenMPI with extensions (–enable-mpi-ext) to do this.
In the example below a user requests 6 cores with core affinity. SLURM assigns 6 individual cores in a cyclic fashion selecting a core from each socket per selection:

srun -A maths --partition=ucthi --time=500:00 --nodes=1 --ntasks=6 --cpu_bind=verbose --cpu_bind=core --distribution=block:cyclic myprog.o

cpu_bind=MASK - hpc400, task 0 0 [23267]: mask 0x1 set
cpu_bind=MASK - hpc400, task 1 1 [23268]: mask 0x2 set
cpu_bind=MASK - hpc400, task 3 3 [23270]: mask 0x8 set
cpu_bind=MASK - hpc400, task 2 2 [23269]: mask 0x4 set
cpu_bind=MASK - hpc400, task 5 5 [23272]: mask 0x20 set
cpu_bind=MASK - hpc400, task 4 4 [23271]: mask 0x10 set
==================================================================
rank 1: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 6[hwt 0]] Process 1 on hpc400 out of 6.

rank 2: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 1[hwt 0]] Process 2 on hpc400 out of 6.

rank 3: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 7[hwt 0]] Process 3 on hpc400 out of 6.

rank 4: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 2[hwt 0]] Process 4 on hpc400 out of 6.

rank 0: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]] Process 0 on hpc400 out of 6.

rank 5: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 8[hwt 0]] Process 5 on hpc400 out of 6.

The equivalent report bindings output would be:

rank 1 bound to socket 1[core 6]: [......][B.....]
rank 2 bound to socket 0[core 1]: [.B....][......]
rank 3 bound to socket 1[core 7]: [......][.B....]
rank 4 bound to socket 0[core 2]: [..B...][......]
rank 0 bound to socket 0[core 0]: [B.....][......]
rank 5 bound to socket 1[core 8]: [......][..B...]

In the example below a user requests 6 cores with socket affinity. SLURM assigns 6 cores from one socket for each task:

srun -A maths --partition=ucthi --time=500:00 --nodes=1 --ntasks=6 --cpu_bind=verbose --cpu_bind=socket --distribution=block:block myprog.o

cpu_bind=MASK - hpc400, task 0 0 [23322]: mask 0x555 set
cpu_bind=MASK - hpc400, task 1 1 [23323]: mask 0x555 set
cpu_bind=MASK - hpc400, task 3 3 [23325]: mask 0x555 set
cpu_bind=MASK - hpc400, task 4 4 [23326]: mask 0x555 set
cpu_bind=MASK - hpc400, task 5 5 [23327]: mask 0x555 set
cpu_bind=MASK - hpc400, task 2 2 [23324]: mask 0x555 set
==================================================================
rank 1: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]] Process 1 on hpc400 out of 6.

rank 2: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]] Process 2 on hpc400 out of 6.

rank 3: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]] Process 3 on hpc400 out of 6.

rank 4: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]] Process 4 on hpc400 out of 6.

rank 5: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]] Process 5 on hpc400 out of 6.

rank 0: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]] Process 0 on hpc400 out of 6.

The equivalent report bindings output would be:

rank 1 bound to socket 0[core 0-5]: [BBBBBB][......]
rank 2 bound to socket 0[core 0-5]: [BBBBBB][......]
rank 3 bound to socket 1[core 0-5]: [BBBBBB][......]
rank 4 bound to socket 0[core 0-5]: [BBBBBB][......]
rank 5 bound to socket 0[core 0-5]: [BBBBBB][......]
rank 0 bound to socket 1[core 0-5]: [BBBBBB][......]

For distribution over multiple nodes use the parameter –ntasks-per-socket to maintain symmetry.

Below a user has requested 8 tasks over 2 nodes binding only to individual cores:

ssrun -A maths --partition=ucthi --time=500:00 --nodes=2 --ntasks-per-socket=2 --ntasks=8 --cpu_bind=verbose --cpu_bind=core myprog.o
cpu_bind=MASK - hpc401, task 7 3 [12991]: mask 0x8 set
cpu_bind=MASK - hpc401, task 4 0 [12988]: mask 0x1 set
cpu_bind=MASK - hpc401, task 6 2 [12990]: mask 0x4 set
cpu_bind=MASK - hpc401, task 5 1 [12989]: mask 0x2 set
cpu_bind=MASK - hpc400, task 0 0 [23970]: mask 0x1 set
cpu_bind=MASK - hpc400, task 1 1 [23971]: mask 0x2 set
cpu_bind=MASK - hpc400, task 2 2 [23972]: mask 0x4 set
cpu_bind=MASK - hpc400, task 3 3 [23973]: mask 0x8 set
==================================================================
rank 5: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 6[hwt 0]] Process 5 on hpc401 out of 8.

rank 1: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 6[hwt 0]] Process 1 on hpc400 out of 8.

rank 6: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 1[hwt 0]] Process 6 on hpc401 out of 8.

rank 7: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 7[hwt 0]] Process 7 on hpc401 out of 8.

rank 4: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]] Process 4 on hpc401 out of 8.

rank 2: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 1[hwt 0]] Process 2 on hpc400 out of 8.

rank 3: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 7[hwt 0]] Process 3 on hpc400 out of 8.

rank 0: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]] Process 0 on hpc400 out of 8.

The equivalent report bindings output would be:

rank 5 bound to socket 0[core 0-11]: [......][......] [......][B.....]
rank 1 bound to socket 1[core 0-11]: [......][B.....] [......][......]
rank 6 bound to socket 1[core 0-11]: [......][......] [.B....][......]
rank 7 bound to socket 1[core 0-11]: [......][......] [......][.B....]
rank 4 bound to socket 1[core 0-11]: [......][......] [B.....][......]
rank 2 bound to socket 1[core 0-11]: [.B....][......] [......][......]
rank 3 bound to socket 1[core 0-11]: [......][.B....] [......][......]
rank 0 bound to socket 1[core 0-11]: [B.....][......] [......][......]

Below a user has requested 8 tasks over 2 nodes binding only to sockets:

srun -A maths --partition=ucthi --time=500:00 --nodes=2 --ntasks-per-socket=2 --ntasks=8 --cpu_bind=verbose --cpu_bind=socket myprog.o
cpu_bind=MASK - hpc401, task 5 1 [13011]: mask 0xa set
cpu_bind=MASK - hpc401, task 4 0 [13010]: mask 0x5 set
cpu_bind=MASK - hpc400, task 0 0 [23994]: mask 0x5 set
cpu_bind=MASK - hpc400, task 3 3 [23997]: mask 0xa set
cpu_bind=MASK - hpc401, task 7 3 [13013]: mask 0xa set
cpu_bind=MASK - hpc401, task 6 2 [13012]: mask 0x5 set
cpu_bind=MASK - hpc400, task 2 2 [23996]: mask 0x5 set
cpu_bind=MASK - hpc400, task 1 1 [23995]: mask 0xa set
==================================================================
rank 5: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]] Process 5 on hpc401 out of 8.

rank 6: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]] Process 6 on hpc401 out of 8.

rank 1: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]] Process 1 on hpc400 out of 8.

rank 7: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]] Process 7 on hpc401 out of 8.

rank 4: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]] Process 4 on hpc401 out of 8.

rank 2: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]] Process 2 on hpc400 out of 8.

rank 3: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 1[core 6[hwt 0]], socket 1[core 7[hwt 0]] Process 3 on hpc400 out of 8.

rank 0: ompi_bound: Not bound (i.e., bound to all processors)
current_binding: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]] Process 0 on hpc400 out of 8.

The equivalent report bindings output would be:

rank 5 bound to socket 0[core 0-11]: [......][......] [......][BB....]
rank 6 bound to socket 1[core 0-11]: [......][......] [BB....][......]
rank 1 bound to socket 1[core 0-11]: [......][BB....] [......][......]
rank 7 bound to socket 1[core 0-11]: [......][......] [......][BB....]
rank 4 bound to socket 1[core 0-11]: [......][......] [BB....][......]
rank 2 bound to socket 1[core 0-11]: [BB....][......] [......][......]
rank 3 bound to socket 1[core 0-11]: [......][BB....] [......][......]
rank 0 bound to socket 1[core 0-11]: [BB....][......] [......][......]

Additional useful documentation on SLURM affinity can be found at these two pages:

https://computing.llnl.gov/linux/slurm/mc_support.html

http://slurm.schedmd.com/srun.html#OPT_distribution

MPI affinity

MPI socket vs core affinity

HEX

SLURM

UCTHPC