MPI_ERR_TRUNCATE

Message

staalstrom · Tue Mar 02, 2021 6:46 pm

Hei

I'm trying to start a ROMS model run, but get problem after reading all the data from the INI file. I think the code is about to read the BRY file. This is what I get in the log file:

NLM: GET_STATE - Reading state initial conditions, 2017-07-24 12:00:00.00
(Grid 01, t = 17371.5000, File: micro50m_dH80_ini_mod.nc, Rec=0001, Index=1)
- free-surface
(Min = 0.00000000E+00 Max = 1.80467248E-01)
- vertically integrated u-momentum component
(Min = -1.37338787E-01 Max = 2.01014299E-02)
- vertically integrated v-momentum component
(Min = -2.59839240E-02 Max = 3.77318333E-03)
- u-momentum component
(Min = -2.99569547E-01 Max = 2.72806704E-01)
- v-momentum component
(Min = -5.64891249E-02 Max = 5.14226630E-02)
- potential temperature
(Min = 0.00000000E+00 Max = 1.88401909E+01)
- salinity
(Min = 0.00000000E+00 Max = 3.44776344E+01)
[c5-3:63176] *** An error occurred in MPI_Bcast
[c5-3:63176] *** reported by process [2129264641,16]
[c5-3:63176] *** on communicator MPI_COMM_WORLD
[c5-3:63176] *** MPI_ERR_TRUNCATE: message truncated
[c5-3:63176] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c5-3:63176] *** and potentially your MPI job)
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libifcoremt.so.5 00002B3963777F3F for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B39629425F0 Unknown Unknown Unknown
libuct.so.0.0.0 00002B3978065275 uct_rc_mlx5_iface Unknown Unknown
libucp.so.0.0.0 00002B3977AAF9F2 ucp_worker_progre Unknown Unknown
mca_pml_ucx.so 00002B397788FC74 mca_pml_ucx_progr Unknown Unknown
libopen-pal.so.20 00002B3967064581 opal_progress Unknown Unknown
libmpi.so.20.10.2 00002B396323BF70 ompi_request_defa Unknown Unknown
libmpi.so.20.10.2 00002B396326FB41 ompi_coll_base_bc Unknown Unknown
libmpi.so.20.10.2 00002B396326FFCE ompi_coll_base_bc Unknown Unknown
mca_coll_tuned.so 00002B397881C038 ompi_coll_tuned_b Unknown Unknown
libmpi.so.20.10.2 00002B396324E53A MPI_Bcast Unknown Unknown
libmpi_mpifh.so.2 00002B3962FD4024 pmpi_bcast Unknown Unknown
romsG 0000000001D86EE5 distribute_mod_mp 824 distribute.f90
romsG 0000000000CFD830 initial_ 219 initial.f90
romsG 0000000000405CFD ocean_control_mod 142 ocean_control.f90
romsG 00000000004043C4 MAIN__ 96 master.f90
romsG 000000000040400E Unknown Unknown Unknown
libc-2.17.so 00002B3965E8C505 __libc_start_main Unknown Unknown
romsG 0000000000403F09 Unknown Unknown Unknown

I have sent this error message to the support team at the computer center (NOTUR in Norway), and they say:

"MPI_ERR_TRUNCATE is caused by the receive buffer being too small to hold the incoming message. That means that the sender sends more data than the receiver declared he wants to receive. Consequently, this is most likely a bug in the code itself. If you implemented it, you have to make sure that the send/recv match. This can sometimes be unclear for multi-threaded codes.

If it is not your code I think you should contact the developers and send them the error message you showed us."

Any help would be appreciated.

brds Andre

kate · #2 Unread post by **kate** » Tue Mar 02, 2021 7:19 pm

initial_ 219 initial.f90

What is on line 219 of your initial.f90? This would be in your scratch directory.

staalstrom · Wed Mar 03, 2021 8:18 am

Line 219 in initial.f90:

CALL mp_bcasti (ng, iNLM, exit_flag)

staalstrom · Wed Mar 03, 2021 8:23 am

Line 824-825 in distribute.f90:

CALL mpi_bcast (A, Npts, MPI_INTEGER, MyMaster, OCN_COMM_WORLD, &
& MyError)

arango · #5 Unread post by **arango** » Wed Mar 03, 2021 4:30 pm

I think that this error has nothing to do with ROMS but communications between your computer processes with the MPI library. Sometimes, when you are using lots of processes one or more do not arrive at the collective MPI synchronization point and the system hangs out or gives this type of error. I get such errors in computer clusters when the application is using more than one node. Usually, the processes that go bonkers are in nodes not containing the master rank. It is associated with intra-node communication hardware. I have a lot of headaches with this problem because of its random behavior. Technically, one or more processes can exhaust the buffer space that can cause failure in the retransmitting of dropped messages. In your case, this is maybe happening before the simple call to mp_bcasti.

By the way, how many processes are you using?

Ocean Modeling Discussion

MPI_ERR_TRUNCATE

MPI_ERR_TRUNCATE

Re: MPI_ERR_TRUNCATE

Re: MPI_ERR_TRUNCATE

Re: MPI_ERR_TRUNCATE

Re: MPI_ERR_TRUNCATE