Hi Hernan,
I did some timings of MPI and SMS ROMS and observed that MPI-ROMS runs slower than SMS-ROMS. For example on our machine, the halo updates took 4.5 seconds for MPI ROMS and about 2.6 seconds for SMS-ROMS for 12 processors for 20 time steps of the baby bear benchmark (30 second run-time overall).
I looked at the MPI-ROMS code and noticed that it does not attempt to aggregate halo updates of multiple variables. This could hamper scalability, particularly on the high latency IBM machines. Also, it appears that MPI-ROMS always updates the entire halo region on each side. When I implemented SMS-ROMS, I found that it was possible to update only the halo points that were needed in each case (maybe 1 on one side and 2 on the other). I'm not sure without looking at the MPI-ROMS in detail if this optimization would yield dividends but I suspect it might.
I imagine you're already aware of these issues but I thought I'd mention them in case you are not. I'd be happy to give you more info if you are interested.
Dan Schaffer