MPI

From diagham
Jump to navigation Jump to search

To run on a cluster or any machine with distributed memory, DiagHam relies on MPI. MPI support has to be enabled through the --enable-mpi option of configure when installing DiagHam. By default, DiagHam will for the mpiCC fo the MPI C++ compiler. If your MPI C++ has a different name (such as mpic++ using openmpi for Ubuntu), you can set it adding the --with-mpi-cxx="mympic++compiler" configure option.

Two modes are available when using MPI : a MPI only mode and a MPI/SMP mixed mode. Notice that running DiagHam on a single machine with multiple cores or CPUs does not require MPI. In that case, the SMP mode provides the parallel mode support.

Compiling with MPI

DiagHam has been extensively tested with openMPI. There should be no issue if you use this version of MPI.


If you use another MPI library (such as the MPI library), you can force the MPI compiler name using an option such as --with-mpi-cxx="mpiicpc" when configuring DiagHam.The Intel MPI compiler seems to be pretty picky. So if you have error messages like


   /afs/@cell/common/soft/intel/ics2013/impi/4.1.3/intel64/include/mpicxx.h(95): error: #error directive: "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h"
     #error "SEEK_SET is #defined but must not be for the C++ binding of MPI. Include mpi.h before stdio.h"

You should append CPPFLAGS="-DMPICH_IGNORE_CXX_SEEK" in front of the configure line. A full blown example of a configure line (including the support of MKL and scalack) could look like

CPPFLAGS="-DMPICH_IGNORE_CXX_SEEK" ../configure --enable-fqhe --enable-fti --enable-spin --enable-intelmkl --enable-mpi --enable-debug CC="icc" CXX="icpc" --enable-lapack --with-intelmkl-libdir="/afs/@cell/common/soft/intel/ics2016.2/16.0/linux/mkl/lib/intel64" --with-blas-libdir="-L/afs/@cell/common/soft/intel/ics2016.2/16.0/linux/mkl/lib/intel64" --with-blas-libs="-lmkl_blas95_ilp64" --with-lapack-libs="-lmkl_lapack95_ilp64" --with-mpi-cxx="mpiicpc" --enable-scalapack --with-scalapack-libs="-lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64"

MPI only mode

The MPI only mode can be turned on using the --mpi option.


MPI/SMP mode

The MPI/SMP mixed mode allows the advantage of SMP (like shared memory) on a given cluster node. In that case, only one instance of MPI has to run on each node. The MPI/SMP mode is enabled through the --mpi-smp option. This option requires a text file that describes the cluster.

Full cluster description

The simplest and most tunable case is to give the precise cluster description. The typical file that you have to provide should look like:

   nostromo1 6 0.11 60000
   nostromo2 6 0.15 50000
   nostromo3 4 0.14 60000

The first column is the node hostname, the second column is the number of CPU/cores that can be used on each node, and the fourth column is the amount of memory (in Mb) that can be used. The fourth column is optional and is just a way to specify a per-node --memory option (like the one of FQHECheckerboardLatticeModel). The third column is also optional. It allows to do static load balancing and indicate the relative performance of a single core/CPU on a given node.


Since MPI/SMP mode assume only one MPI instance per node, the MPI hostfile has to be such as

   nostromo1 slots=1
   nostromo2 slots=1 
   nostromo3 slots=1


Other possible cluster descriptions

Such a way to describe the cluster might not be the most convenient in several cases. For example, if your cluster uses a batch queue, it is quite difficult to know which nodes will be used. Two other options are provided. A simple cluster description can be

   master 1 0.1 100
   default 2 0.15 400

In that case all the node will have the same configuration (i.e number of CPUs, performance index and memory) than the node described by "default" hostname. An additional line can be provided to give a different configuration for the master node, using "master" as hostname.

The second method allows to tune a group of nodes having the same prefix. If the cluster has two types of machines : "nostromoX" and "discoveryY" where X and Y are unique for each node (can be any string, integers,...). Then the configuration can be set with a file such as

   nostromo* 4 0.2 10000
   discovery* 8 0.15 60000

There, each nostromoX node will have the configuration 4 cores, 10Gb of memory and a performance of 0.2, while each discoveryY will have the configuration 8 cores, 60Gb of memory and a performance of 0.15. This second method can be mixed with the more general configuration file: For example, one can have a generic configuration for the nostromo nodes and a specific configuration for each discovery node.

Profiling

DiagHam can write a log file that describes how much time is spent in any operation that relies on MPI and or SMP. The name of the log file has to be provided using the --cluster-profil option. Such a log file looks like

   number of nodes = 4.29181
   cluster description : 
       node 0 :  hostname=nostromo1  cpu=6  perf. index=0.148611 mem=60000Mb
       node 1 :  hostname=nostromo2  cpu=6  perf. index=0.21176 mem=50000Mb
       node 2 :  hostname=nostromo3  cpu=4  perf. index=0.222019 mem=60000Mb
    ---------------------------------------------
                   profiling                 
    ---------------------------------------------
   node 0: VectorHamiltonianMultiply core operation done in 3.002 seconds
   node 1: VectorHamiltonianMultiply core operation done in 7.273 seconds
   node 2: VectorHamiltonianMultiply core operation done in 8.863 seconds
   node 0: VectorHamiltonianMultiply sum operation done in 0.534 seconds


Dynamic load balancing is not yet available on DiagHam. But the information provided by such a log file allows to optimize the static load balancing. The PERL script scripts/ClusterProfiling.pl can help to do that

scripts/ClusterProfiling.pl cluster.log cluster.desc

where cluster.log is the profiling log file and cluster.dec the cluster description. The typical output is

   VectorHamiltonianMultiply sum operation :
     global total time = 469.393999999999s
     total CPU time = 469.393999999999s
     mean time = 0.399144557823129s
     number of calls = 1176
     total number of calls = 1176
     per node informations :
       node 0 : 
         total time = 469.393999999999s
         number of calls = 1176s
         mean time per call = 0.399144557823129+/-0.00547376435672348s
         min time per call = 0.387s
         max time per call = 0.534s
   ------------
   VectorHamiltonianMultiply core operation :
     global total time = 15968.545s
     total CPU time = 51840.2419999999s
     mean time = 8.81636768707482s
     number of calls = 1176
     total number of calls = 5880
     per node informations :
       node 0 : 
         total time = 3517.615s
         number of calls = 1176s
         mean time per call = 2.99116921768708+/-0.0516665897986875s
         min time per call = 2.941s
         max time per call = 3.587s
       node 1 : 
         total time = 8613.44299999998s
         number of calls = 1176s
         mean time per call = 7.32435629251699+/-0.181785287943176s
         min time per call = 7.164s
         max time per call = 9.276s
       node 2 : 
         total time = 10513.155s
         number of calls = 1176s
         mean time per call = 8.93975765306121+/-0.482395743143856s
         min time per call = 8.739s
         max time per call = 11.764s
   ------------
   ------------
   optimized performance index
   ------------
     node 0 :  perf. index=0.361140846848426
     node 1 :  perf. index=0.2101556663922
     node 2 :  perf. index=0.180522373455297
   nostromo1 6 0.361140846848426 60000
   nostromo2 6 0.2101556663922 50000
   nostromo3 4 0.180522373455297 60000

The end part below optimized performance index can be used as a new optimized cluster description.