-*- text -*- * RPI SSI (jsquyres) ------------------ - Re-enable GER in mpirun - Make possible to pass args back from SSI configure scripts - Check SSI args (env vars) before launching via mpirun -- one error message instead of N [interlaced] messages - write .ps kinds of docs for SSI authors - update RPI docs for SSI - convert the --with flags for the RPIs to use the standard prefix: e.g., --with-ssi-rpi-tcp-short=BYTES - put the LAM_CONFIGURE_SETUP macro first in the RPI configure scripts (to adhere to the docs) - figure out how to pass results of SSI configure scripts back up to top-level configure script (e.g., required LDFLAGS) - convert gm RPI to SSI - figure out a better way to have module configure script failures -- right now, modules must still get compiled even if they are not supported (e.g., compiling gm RPI on my laptop), and "lamboot -V" will show that they are compiled in. It should be possible to skip compiling them altogether. - Make env var detection code in lamd/tcp open calls --> DONE - update mpirun docs for -ssi , LAM_MPI_=, and -lamd/-c2c news. --> DONE - finish man page about LAM_SSI_ssi_verbose flag -- forgot to talk about all the options that it can take --> DONE * gm RPI (jsquyres) ----------------- - do a general data pooling module - do data pooling on the cbuf stuff so that there's not a malooc for every insertion - do data pooling for the lists of requests (send and recv queues) - generalize it to allow rpi tosupply a malloc/free function? -- yes - redo the #include file ordering for rpi_c2c.h so that we don't have to malloc part of the _req for the gm rpi - getting random heisenbugs in alltest -- seems to only happen when debugging is off. Somehow, a buffer of 0x18 get's returned to the dreg pool from the short.c:short_send_body_unpin_callback function. Haven't tracked down how this is happening yet. Try stepping through with debugger, and/or using bcheck...? --> Fixed - fastsend/fastrecv - not implemented yet - test heterogeneity - this is almost certainly broken... - add stuff for tracing - write a README in share/rpi/gm explaining what the gm RPI does and how it works - moved gm configure.in stuff to share/rpi/gm - Performance - Tuning * Quadrics RPI (TBD) ------------------ - Get hardware and API docs - Funding would be good :) * Mellonox InfiniBand RPI (TBD) ----------------------------- - Get hardware and API docs - Funding would be good :) * TCP RPI (brbarret) ------------------ - TCP RPI env variables to reset short/long size * Make Multi-SSI goodness (brbarret) ---------------------------------- * mpirun help (brbarret) ---------------------- - Make safe for C/R - Make safe for (remote) node failures * LAMD Changes (brbarret) ----------------------- * RPI Docs (jsquyres) ------------------- - Update for all changes * BPROC (brbarret) ---------------- - make SSI-able - what can we do with VMADump? * Condor Boot SSI (brbarret) -------------------------- * TM Boot SSI (brbarret) ---------------------- - Look at tm_info for getting hostname info - add better parsing code - can we factor code out? * LBNL Checkpoint / Restart (sriram) ----------------------------------- - finish - featurize - Write paper * Checkpoint / Restart SSI (sriram) --------------------------------- - make sure also support migration - what need for Condor * PBS Integration (tm) (brbarret) ------------------------------- - Do lamshrink/lamgrow work in the PBS environment? Do they need to be extended to pass the socket name, like lamboot was? - Eric Roman idea: if you mpirun -pbs and there's no running lamd, fork/exec/wait a "lamboot -pbs" behind the scenes. Hence, there's no need for a user to do lamboot or wipe in a PBS job (but they don't need to know that). - Add usage of PBS_TMDIR * Performance Testing (mini-llamas) --------------------------------- - Get some * IMPI (TBD) ---------- - Get Funding :) * Windows Support (TBD) --------------------- - Get running under Cygwin (brbarret) - Get Windows guy * Update to latest ROMIO (jsquyres) ---------------------------------