Small Projects: --------------- * Build System: - Raja suggests a --with option to configure that will make all four Fortran conventions in the fortran library(ies?). Hence, a vendor can do this in an RPM and there maximize the possibility to client code working properly with it. It won't fix fortran-string-handling differences between different fortran compilers (for example), but it may expand the shelf-life of an RPM on a CD. ...or does it? If only the symbol convention is fixed and nothing else is, does this really work for multiple fortran compilers? Hrmm. Doubtful. See http://www.lam-mpi.org/Internal/llamas/msg01612.php3. - Make the lamtests module not suck. There are several things that could be done better: + eliminate much of the deep Makefile voodoo that makes the lamtests suite work; move it all to shell scripts or something. + add some f77 tests (e.g., MPI_STATUSES_IGNORE). + do a "lamboot" if it's not already there. + perhaps make lamtests be "make check" in the main LAM distribution so that it's not a separate package -- it could then piggyback off the main configure script, and wouldn't need to be a second download. - Make a configure test to see if the fortran compiler can accept -I or not; add -I in hf77.c. Will need to add some hf77 flag to not use -I if user overrides default fortran compiler (e.g., with a compiler that does not support -I). * The ability to specify which ports LAM should use. * Implement mpiexec -- command line version of mpi_comm_spawn * cmd line option to mpirun/lamexec to run each command in a separate xterm with the DISPLAY piped back to the invoking node (e.g., mpirun N -xterm gdb myprogam); check to ensure DISPLAY is set properly (i.e., check to ensure that it does not start with ":" or is empty -- also give option to not export DISPLAY -- e.g., for ssh). * How about an mpirun option that makes Send/Rsend act as Ssend, Isend/Irsend act as Issend, and Send_init/Rsend_init act as Ssend_init? MPI_Init() sets a global flag for this, and the various entry-point routines check it and pass switch the LAM_RQ*SEND flag as needed. * Improve NULL checking for choice arguments (e.g., MPI_Send). It's in lambuf.c:lam_bufinit(). * From Raja: I noticed that on both 10.01 and 10.20, wipe is much slower than in previous LAM versions, even for a 1-node cluster. Anything you know about or should this be investigated (low priority, doesn't hold the release)? * Add implicit "." in PATH on remote nodes; per messages on LLAMAS and LAM list, if "." is not in the PATH, "mpirun N foo" will not work, causing confusion for the user. Not having "." is the default in some Linuxes, for example, a big target OS for LAM. * The haltd in the lamd has a sleep(1) in it to allow other things to happen before the lamd actually dies. This has the side effect of "lamhalt" returning before the lamd actually dies, and if you have a fast processor and are (for example) running a script, you may run into a race condition where there *should* be a lamd running, but there isn't because the delayed lamhalt/tkill got it. For example: while (forever) lamboot mpirun C foo lamhalt This can run into problems on fast processors. The fix: somehow make lamhalt wait until the local lamd actually dies before returning. * Add evironment variable LAM_MPI_RSH to do the same as LAM_RSH. Find all environment variables that we're using, have them all look for "LAM_MPI_" versions (in addition to the old names, of course), and make a list of them somewhere -- they're not clearly documented anywhere. * Add support for MPI_LONG_LONG_INT in builtin reduction operations. See http://www.lam-mpi.org/Internal/llamas/msg01537.php3. * Add another MPI_Info flag to MPI_Comm_spawn to do the equivalent of mpirun's "-s" to send executables to remote nodes. * In a bootschema, if a different username is specified for the localhost, we should rsh to launch that one, not lam_few(). * Make recon check for writability in /tmp (since the lamd will need it), and have fail with an appropriate error message if it can't write to /tmp/. Thanks to Alex Rhomberg for pointing this out (http://www.lam-mpi.org/MailArchives/lam/msg01599.php3). * Reuse communicator ID's. Perhaps via bitmap (since we only have 12 bits total). But this will change a bunch of the collective algorithms for getting new CIDs (e.g., MPI_COMM_DUP), because it breaks the simple "just get the max" assumption. We "sort of" reuse cid's now, but not really (we will mark cid's as "unused" when we are finished with them, but when we need a new cid, we still take the highest non-used cid. Not quite the same as reusing all cid's). * Update C++ bindings to include all implemented MPI-2 functions * Make run-time flag (env variable?) to disable all parameter checking in MPI functions for added performance. * Make large shmem copies incremental, a la Sun's MPI. This will decrease latency and increase bandwidth on SMPs. * Fortran programs and C/C++ programs that effectively do MPI_Init(NULL, NULL) do not show nicely in mpitask. So use mpirun to pass argv[0] to the processes, possibly with the -e mechanism. Raja suggest this in http://www.lam-mpi.org/Internal/llamas/msg01137.php3 (how about using global environ variable?) * Add the C++ datatypes in the C++ package per the MPI-2 standard. --> Steal from the open-sourced Sun MPI. (can we do so?) * Nick also suggests that we should allow "mpirun n0:4 n1:2 ..." command line arguments to specify how many copies to start on each node. Long Term Projects ------------------ (in Andy's order of importance) * Fault Tolerance / Reliability (currently worked on by brbarret) - what happens to mpirun if "tkill" (effectively) happens on a node during a run? i.e., the fault tolerant example problem. The real solution is that when a rank discovers that another rank is dead, it should notify mpirun somehow. Then mpirun will wait for one less process to finish (however, mpirun will have to keep track of who ranks tell it not to wait for -- it is possible that multiple ranks will report to mpirun that rank 7 has died, so mpirun needs to register that rank 7 is dead only once). Someone suggested having an option to mpirun to not abort an entire parallel app when one rank dies. This would entail fixing mpirun to not rpdoom when it gets a non-zero return status, as well as potentially not having the app itself do an rpdoom (i.e., a new run-time flag). * Benchmarking Infrastructure There are two parts to this project. First, we need an infrastructure for developing benchmark results. There are a number of projects that look at various metrics for an MPI implementation. The end goal is to have a "make benchmark" command that would run all the tests, save the results to a file, and make pretty graphs that give instant useful feedback to the development team. Second is to use the benchmarking framework and some analysis tools to find all the places where LAM/MPI's performance is lacking. For example, if we are slow on persistent sends, use instruction trace utilities to determine why we are slow on persistent sends. There have also been a number of changes to the communication infrastructure, so it would be very helpful to have a performance comparison to determine how the current development tree relates to previous releases. * Windows Support Currently, LAM/MPI does not operate correctly in the windows environment. The initial goal is to get LAM/MPI operating correctly under the Cygwin Unix portability layer and slowly add native Windows support as time goes on. * TotalView Support Add support for TotalView debugger, per http://www.mcs.anl.gov/~gropp/papers/pvmmpi99/eurompi-paper.ps. More specific information is in the MPICH distribution, in src/infoexport/*, including both the MPICH implementation as well as a detailed document with the required API and whatnot. See if we can scam a free TotalView license out of this. :-) * Optimized Collectives integrate MagPIe into LAM. The collectives are falling behind in LAM, still treating all processes as equi-distant. Many of the collectives are heavily used by commercial codes. Thilo has made his code layerable; LAM could provide the routines to determine the cluster topology, using shmem-vs-tcp as the divide level. It could be done in two steps: - Provide a patch for 6.3b that adds these two routines, test the combo as is, and announce it (mailing list + web page). - For 6.4 slurp it into LAM, with Thilo's blessing, so users don't have to rely on PMPI and have it work out-of-the-box (or tarfile). If testing shows that MagPIe is not an across-the-board win, then decide where the big win is and cut-n-paste the parts needed. - You could check: + If Thilo is willing to help by being part of the LAM extended team. + If somebody on the LAM mailing list wants to help in testing and performance measurement on the variety of clusters out there. I don't know if somebody has enough info to provide the two LAM-specific routines, you may have to do that in-house. *** Followup: Thilo Kielmann says that it already works with LAM, and he would love to have it become an [optional] part of LAM. I told him it would probably take a while (IMPI first), but we would get to it someday. JMS * Condor The Condor environment is a cycle savaging system developed at U. of Wisconsin. There are a number of issues regarding job started, process migration, and checkpointing that would need to be examined in this project. * VIA Support Via is an OS-bypass communication interface implemented by many high-performance networks. LAM/MPI currently must communicate over the slower TCP/IP interface to utilize VIA hardware. We would like to use the native VIA interface to improve performance. This requires knowledge of the VIA interface and the LAM transport engine. While probably the most difficult code to write of all the projects listed here, it is using the most well-documented interfaces in the LAM/MPI project. * Web Site Updates: - better/more obvious navigation - put performance numbers (and some comparisons) right up front (Andy) - put Linux/BSD logos right up front for those who carry LAM, potentially with cross links to relevant posts on the mailing list, notes like "RedHat 6.2 users encouraged to upgrade to x.y.z" (Raja) * Threads in MPI layer Make LAM MT-hot (allow last level of MPI_INIT_THREAD -- MPI_THREAD_MULTIPLE). Can probably do this by one of two methods: 1. have a separate thread for the progress engine (see below), or 2. allow the first user thread to go down and be the progress engine (this will mean that the first thread has to act as a true progress engine, though, so it's not much different than #1, meaning that it will have to selectively unblock other user threads when messages come in for them as it is waiting for its own message). Either method will require the selective unblocking of threads when a message comes in. #1 is probably less complicated, actually. - Have a thread to run the progress engine so that message passing can really occur in the background. Probably only necessary/advisable for MPI_THREAD_MULTIPLE since it will mandate the use of locks and whatnot. This would necessitate LAM being MT-hot (i.e., if none of the user threads go down into the progresss engine at all, except perhaps by bypassing it in the fastsend/fastrecv). This will essentially separate an MPI process into 2 parts: the progress engine and the user code. See JMS phd proposal. - Perhaps split the progress engine to have a separate thread for each destination (or some percentage of destinations to ensure that it scales). See JMS phd proposal. * IPv6 Support Add IPv6 support for All Things using TCP. Not quite as simple as it sounds -- although we have a small number of places that do IP name lookups and open sockets, there are both a larger number of places that require hostname parsing that will need to change, as well as the internals of the lamd (which hold IPv4 address tables) and the lamboot protocols (which exchange IPv4 adresses) to modify as well. Helpful URLs in learning about IPv6: http://playground.sun.com/pub/ipng/html/INET-IPng-Paper.html www.ipv6.org www.freenet6.org http://www.freenet6.net/ * Better error reporting in lamboot/recon when the remote lamd is not able to start up properly. i.e., hboot immediately severs the stdout/stderr to the lamd so that rsh/ssh can finish. Perhaps we should wait for the connection from the remote lamd before allowing the hboot to quit, and therefore not have to sever the stderr/stdout from the lamd immediately. Hence, the we can have a "-d" (or whatever) debugging output flag to the lamd, who can then send error output messages to stdout/stderr, and rsh/hboot will funnel it back to recon/lamboot. Hence, hboot doesn't die until a) it receives an ACK from lambootagent saying "ok to die, close stdout/stderr" -- we'll also have to tell the lamd to close stdout/stderr, too, or b) lamd dies abnormally, upon which point we nsend something back to lambootagent saying "oops... badness happened here". Perhaps hboot can block in nrecv waiting for a), and be setup to catch SIGCHLD to detect b)...? See http://www.lam-mpi.org/MailArchives/lam/msg01599.php3 and http://www.lam-mpi.org/MailArchives/lam/msg01600.php3. * IMPI - Add attributes on IMPI communicators per section 2.5, IMPI_CLIENT_SIZE, IMPI_CLIENT_COLOR, IMPI_HOST_SIZE, IMPI_HOST_COLOR - Integrate Dog's server package into LAM, make mpirun -server do the Right Thing, use his .h file. - Do the collectives. - Rename all extern variables in impi.h to have common naming scheme * Optimize datatytpes (see the MPICH datatype papers) so that we can avoid all the extra copying and whatnot. Commendable Activites --------------------- - Do enough of MPI_CANCEL so that we can claim that it is conformant. - Add new RPI functions for basic collectives: probably only barrier and broadcast. (This has dubious value -- Raja is quite against it) - In C2C mode, open sockets the first time they are needed -- don't just make a fully connected mesh upon MPI_INIT/MPI_COMM_SPAWN*. - Make stdout/stderr between ranks scalable -- i.e., some kind of tree-based heirarchy of passing stdout/stderr back to mpirun. Also add "rank:" prefixes to each line of output (add a command line switch to mpirun to enable this). PROBLEMATIC: local ranks, for example, just write directly to mpirun's stdout/stderr -- there's no intervention with any LAM code; it just happens in unix. Rest may be modifiable with the lamd/iod...?