«FT-MPI: Fault Tolerant MPI, supporting dynamic applications in a dynamic world Graham E. Fagg and Jack J. Dongarra Department of Computer Science, ...»
Current MPI debuggers and visualization tools such as totalview, vampir, upshot etc do not have a concept of how to monitor MPI jobs that change their communicators on the fly, nor do they know how to monitor a virtual machine. To assist users in understanding these the author has implemented two monitor tools. Hostinfo which displays the state of the Virtual Machine. Cominfo which displays processes and communicators in colour coded fashion so that users know the state of an applications processes and communicators. Both tools are currently built using the X11 libraries but will be rebuilt using the Java SWING system to aid portability. Example Lecture Notes in Computer Science displays during a SHRINK communicator rebuild operation is shown in figures 2 to 4.
Fig. 2. Cominfo display for a healthy three process MPI application. The colours of the inner boxes indicate the state of the processes and the outer box indicates the communicator state.
Fig. 3. Cominfo display for an application with an exited process. In this case the rank 1 process has exited. Note the communicator is maked as having an error and that the number of processes and size of the communicator are different.
7. Conclusions FT-MPI is an attempt to provide application programmers with different methods of dealing with failure within MPI application than just check-point and restart. It is hoped that by experimenting with FT-MPI, new applications methodologies and algorithms will be developed to allow for both high performance and the survivability required for the next generation of terra-flop and beyond machines.
FT-MPI in itself is already proving to be a useful vehicle for experimenting with selftuning collective communications, distributed control algorithms and improved sparse data handling subsystems, as well as being the default MPI implementation for the HARNESS project.
8. References
1. Beck, Dongarra, Fagg, Geist, Gray, Kohl, Migliardi, K. Moore, T. Moore, P. Papadopoulous, S. Scott, V. Sunderam, "HARNESS: a next generation distributed virtual machine", Journal of Future Generation Computer Systems, (15), Elsevier Science B.V., 1999.
2. G. Stellner, “CoCheck: Checkpointing and Process Migration for MPI”, In Proceedings of the International Parallel Processing Symposium, pp 526-531, Honolulu, April 1996.
3. Adnan Agbaria and Roy Friedman, “Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations”, In the 8th IEEE International Symposium on High Performance Distributed Computing, 1999.
4. Graham E. Fagg, Keith Moore, Jack J. Dongarra, "Scalable networked information processing environment (SNIPE)", Journal of Future Generation Computer Systems, (15), pp.
571-582, Elsevier Science B.V., 1999.
5. Mauro Migliardi and Vaidy Sunderam, “PVM Emulation in the Harness MetaComputing