MPICH/Myricom tiling problem

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
hankseidel
Posts: 11
Joined: Tue Sep 09, 2003 6:51 pm
Location: Texas A&M University

MPICH/Myricom tiling problem

#1 Unread post by hankseidel »

I have just built a 64 processor cluster using a myrinet switch and MPICH1 for MPI. I getting an unusual and random problem. I realize that this may not be a ROMS problem (but a switch, mpi or compiler problem) but I wanted to see if any users have had similar experiences. I am running ROMS3 coupled to the CAM3 atmosphere. Other specifics:

Portland Group compilers
Myricom MX
MPICH1 (MPICH2 is in beta for myrinet)

Now for the problem . . .

I am getting random missing tiles in the output. There can be several or none. The model runs fine but the netcdf output is bad. This includes the restart files so the model is not restartable. Since the model runs fine, the halo exchange through MPI must be working fine. I have placed a plot showing the problem with one missing tile here:

http://pod.tamu.edu/~seidel/results/ROM ... rob_01.gif

I am also reporting this to myricom and the portland group. It is interesting that CAM3 does not have this problem.

Thanks,

Hank

User avatar
arango
Site Admin
Posts: 1367
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

#2 Unread post by arango »

Very interesting problem. I have never observed such behavior. There is not enough information in your posting for me to have an idea of what it is going on, like resolution, partition, memory, etc. Do this happens only for output? Why not input? Is only the restart file? What about the history file?

The output of ROMS is generic and managed by calls to the nf_fwrite2d, nf_fwrite3d, and nf_fwrite4d routines. These routines call mp_gather to collect data from all nodes into one dimensional array. There is a lot of error catching in mp_gather if anything goes wrong. The error flag exit_flag is activated in terminate execution. The fact that this is not happening is very weird.

It is possible that there is something wrong with your MPI library.

hankseidel
Posts: 11
Joined: Tue Sep 09, 2003 6:51 pm
Location: Texas A&M University

#3 Unread post by hankseidel »

The model is a regional Atlantic domain with a resolution of 1/4 degree in both latitude and longitude and 30 layers. So, i=482, j=342 and k=30. I have run this model on other systems without this problem. I don't believe it is a ROMS problem at all but was wondering if any other users had seen it with their models. My compute nodes are four processor itaniums with 2 gig of memory each. I do not get this problem on a single node so I suspect the myricom switch. When I run on 8, 16 or 32 processors I get this problem with ALL output files. Restart, history and average.

Output can be fine or one or several tiles may be corrupted. What is interesting is that at l=4 the history file may have missing tiles but l=5 it is fine. I suspected the netcdf libraries and have rebuilt them with no improvement.

Another interesting aspect is that I have had them model run fine for 10 years in a single run so the halos must be getting passed OK. But on occasion the model will blow up randomly and then after starting the run again it runs past the blowup point (nothing changed in the run). So perhaps this issue is affecting the halo passing on occasion. The restart file from the blowup is indicative of this.

Myricom has been working with me for the past three weeks and they are also baffled but seem to agree that it looks like a switch issue or an issue with the mpich version they have modified for their hardware.

Thanks for your reply!

Hank

hankseidel
Posts: 11
Joined: Tue Sep 09, 2003 6:51 pm
Location: Texas A&M University

Problem Fixed

#4 Unread post by hankseidel »

I thought I would post an update as we have solved this problem. We were on a new version of Linux (Centos 5) and apparently the new kernel broke the myricom drivers. I was told by their support people that this happens from time to time and they try to stay on top of it. As it turned out they provided excellent support and used my installation to fix their software. We could have installed an older kernel but opted for keeping our installed version. All works fine now and our coupled ROMS/CAM3 runs really runs well on our cluster.

Thanks for the help and perhaps if someone has the same issue in the future, this posting will point them in the direction of a solution.

Post Reply