huge memory per node when up scaling application

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
konsole
Posts: 11
Joined: Wed May 04, 2016 12:51 pm
Location: Institute for Marine and Antarctic Science UTAS

huge memory per node when up scaling application

#1 Unread post by konsole »

Dear experienced ocean modellers,

When I up scale my application from 10 km to 2 km resolution, I have trouble getting the model running with a sensible tiling. Please see here a table with previous successful and failed runs and the heads of the error messages at the end of the post:

Code: Select all

Resolution (km) || Cells_i * Cells_j || Tile_i * Tile_j || n cpu || mem_req (GB) || mem_used (GB) ||m em_req/cpu (GB) || ncpus/node || mem_req/node (GB)|| status
	
10	630 * 530	16 * 16	 256	96	44.26	 0.38	16.00	6.00	ok	
4	1575 * 1325	48 * 48	 2304	3072	2900.00	 1.33	16.00	21.33	ok	
2	3150 * 2650	96 * 96	 9216	18432	?	 2.00	16.00	32.00	error 1	
2	3151 * 2650	96 * 96	 9216	36864	?	 4.00	16.00	64.00	error 2	
2	3151 * 2650	64 * 64	 4096	16384	?	 4.00	16.00	64.00	error 2	
2	3152 * 2650	56 * 56	 3136	14336	?	 4.57	28.00	128.00	error 3	
2	3153 * 2650	56 * 28	 1568	14336	4630.00	 9.14	28.00	256.00	ok
The only working combination (56x28 on the new NCI broadwell nodes) does not reflect the almost square shape of the domain and let me wait in the queue for days, since I'm requesting 20% of NCI's nodes of this type. The only possible reason (neglecting the error messages) I spot for the other 2 km set ups not to run is memory per cpu. The NCI support already pointed out that 9 GB per cpu seems ridiculously much. Does anyone has experience with such large ROMS applications and similar experiences?

Thanks for any thoughts on this issue,

Ole


##########################################
error 1:
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[32568,0],0] on node r228
Remote daemon: [[32568,0],369] on node r1860

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B3C8697B7E0 Unknown Unknown Unknown
libpthread-2.12.s 00002B3C86978490 pthread_spin_init Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B2AB175F7E0 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002B2AC888EB18 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002AC8D98087E0 Unknown Unknown Unknown
...

###################################################
error 2:
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B8BEC4F77E0 Unknown Unknown Unknown
mca_btl_openib.so 00002B8C066437C2 Unknown Unknown Unknown
mca_btl_openib.so 00002B8C06643AFE Unknown Unknown Unknown
libopen-pal.so.40 00002B8BF0D9450C opal_progress Unknown Unknown
libmpi.so.40.10.0 00002B8BEBF52CD6 Unknown Unknown Unknown
libmpi.so.40.10.0 00002B8BEBF52D19 ompi_request_defa Unknown Unknown
libmpi.so.40.10.0 00002B8BEBFD5FBF ompi_coll_base_bc Unknown Unknown
libmpi.so.40.10.0 00002B8BEBFD67E2 ompi_coll_base_bc Unknown Unknown
mca_coll_tuned.so 00002B8C090DACB6 ompi_coll_tuned_b Unknown Unknown
libmpi.so.40.10.0 00002B8BEBF6FDAD MPI_Bcast Unknown Unknown
libmpi_mpifh.so.4 00002B8BEBCE5F0C Unknown Unknown Unknown
oceanM 000000000045AF65 Unknown Unknown Unknown
oceanM 00000000006D5BFA Unknown Unknown Unknown
oceanM 000000000068D83D Unknown Unknown Unknown
oceanM 000000000047A94B Unknown Unknown Unknown
oceanM 000000000040CE29 Unknown Unknown Unknown
oceanM 000000000040C478 Unknown Unknown Unknown
oceanM 000000000040C25E Unknown Unknown Unknown
libc-2.12.so 00002B8BEC723D1D __libc_start_main Unknown Unknown
oceanM 000000000040C169 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002AB99828B7E0 Unknown Unknown Unknown
libpthread-2.12.s 00002AB998288453 pthread_spin_lock Unknown Unknown
libmlx4-rdmav2.so 00002AB9AB5005D0 Unknown Unknown Unknown
...

##############################################
error 3:
------------------------------------------------------------------------
Job 8680424.r-man2 has exceeded memory allocation on node r3760
Process "orted", pid 22075, rss 28557312, vmem 401702912
Process "oceanM", pid 22119, rss 4919689216, vmem 5650546688
Process "oceanM", pid 22120, rss 4908826624, vmem 5648678912
Process "oceanM", pid 22121, rss 4914126848, vmem 5648465920
Process "oceanM", pid 22122, rss 4921999360, vmem 5648588800
Process "oceanM", pid 22123, rss 4911214592, vmem 5648367616
Process "oceanM", pid 22124, rss 4918145024, vmem 5648261120
Process "oceanM", pid 22125, rss 4916121600, vmem 5648224256
Process "oceanM", pid 22126, rss 4908433408, vmem 5648265216
Process "oceanM", pid 22127, rss 4912648192, vmem 5647982592
Process "oceanM", pid 22128, rss 4919373824, vmem 5648101376
Process "oceanM", pid 22129, rss 4914343936, vmem 5647953920
Process "oceanM", pid 22130, rss 4916404224, vmem 5648113664
Process "oceanM", pid 22131, rss 4930711552, vmem 5648183296
Process "oceanM", pid 22132, rss 4917096448, vmem 5648289792
Process "oceanM", pid 22133, rss 4921593856, vmem 5648162816
Process "oceanM", pid 22134, rss 4922822656, vmem 5648314368
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

HNP daemon : [[32568,0],0] on node r228
Remote daemon: [[32568,0],369] on node r1860

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B3C8697B7E0 Unknown Unknown Unknown
libpthread-2.12.s 00002B3C86978490 pthread_spin_init Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002B2AB175F7E0 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002B2AC888EB18 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanM 00000000007B57CE for__signal_handl Unknown Unknown
libpthread-2.12.s 00002AC8D98087E0 Unknown Unknown Unknown

Process "oceanM", pid 22135, rss 4921352192, vmem 5648011264
Process "oceanM", pid 22136, rss 4918321152, vmem 5648138240
Process "oceanM", pid 22137, rss 4927127552, vmem 5647982592
Process "oceanM", pid 22138, rss 4932517888, vmem 5648523264
Process "oceanM", pid 22139, rss 4924923904, vmem 5648207872
Process "oceanM", pid 22140, rss 4925964288, vmem 5648740352
Process "oceanM", pid 22141, rss 4935540736, vmem 5648572416
Process "oceanM", pid 22142, rss 4930662400, vmem 5648318464
Process "oceanM", pid 22143, rss 4917063680, vmem 5648359424
Process "oceanM", pid 22144, rss 4928651264, vmem 5648490496
Process "oceanM", pid 22145, rss 4922134528, vmem 5648486400
Process "oceanM", pid 22146, rss 4912766976, vmem 5635805184
------------------------------------------------------------------------
For more information visit https://opus.nci.org.au/x/SwGRAQ
------------------------------------------------------------------------

------------------------------------------------------------------------
Job 8680424.r-man2 has exceeded memory allocation on node r3758
Process "orted", pid 3379, rss 28434432, vmem 401707008
Process "oceanM", pid 3423, rss 4895670272, vmem 5635981312
Process "oceanM", pid 3424, rss 4907446272, vmem 5648654336
Process "oceanM", pid 3425, rss 4902199296, vmem 5648379904

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: huge memory per node when up scaling application

#2 Unread post by kate »

My jobs look a lot more like your 10 km run in size. This code was not designed for such large processor counts as you are trying. Specifically, how are you managing I/O? Are you trying some sort of parallel I/O? What I've heard recommended is something like NCAR's parallel-IO library or the European XIOS, where you want about one process per node just dedicated to I/O. What we have now is one process doing all of the I/O as well as computing for tile 0 unless you try the PARALLEL_IO flag, then all of the processes play at both I/O and computing.

konsole
Posts: 11
Joined: Wed May 04, 2016 12:51 pm
Location: Institute for Marine and Antarctic Science UTAS

Re: huge memory per node when up scaling application

#3 Unread post by konsole »

Hey Kate,

thanks for your reply. I/O could explain my weird HPC errors maybe. I wasn't aware how ROMS manages IO and did not have any parallel options active.
I tried activating PARALLEL_IO in combination with either HDF5 or PNETCDF flags. My activated makefile options are USE_MPI, USE_NETCDF4, USE_MPIf90. I get an error during compilation that I've found unsolved in an earlier forum post.

The error:

cd Build; /apps/openmpi/wrapper/fortran/mpif90 -c -heap-arrays -fp-model precise -ip -O3 output.f90
nf_fread2d.f90(306): error #6285: There is no matching specific subroutine for this generic subroutine call. [MP_COLLECT]
CALL mp_collect (ng, model, Npts, IniVal, Awrk)
-----------------^

The post:
viewtopic.php?f=17&t=4675.


Do you know if anyone implemented the NCAR's parallel-IO library or the European XIOS in ROMS?

Thanks,

ole

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: huge memory per node when up scaling application

#4 Unread post by kate »

The French ROMS (croco) comes with XIOS, but it was only really XOS (output) last I checked.

I tried and failed to get the NCAR package to work with ROMS several years ago. I might still have that branch if you want to play with it. I got it to run with all the cores in the I/O pool, but not with one core per node.

User avatar
arango
Site Admin
Posts: 1367
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: huge memory per node when up scaling application

#5 Unread post by arango »

First, I don't think that ROMS will run with the PNETCDF library. It will need special function calls in ROMS. I never tried that library, it is a third party library. If you use PARALLEL_IO, you need to compile with the NetCDF-4 library with parallel MPI I/O and HDF5 support and your computer need to have the hardware and communications for parallel I/O.

This seems to be a large application. You need to find the optimal tile partition for it. If you use excessive numbers of CPUs, you are going to be penalized by excessive MPI communications and your application will run slower. One needs to profile and find the optimal parallel configuration. This is usually done by trying various partitions and profiling with and without I/O. There is not a magic formula that we can give you since there are many things to consider. I always recommend fewer partitions X and much more in Y to facilitate vectorization and continuous memory usage in X. Of course, one needs to make sure that the tiled arrays fit on the parallel node memory.

konsole
Posts: 11
Joined: Wed May 04, 2016 12:51 pm
Location: Institute for Marine and Antarctic Science UTAS

Re: huge memory per node when up scaling application

#6 Unread post by konsole »

I've checked with the supercomputer support team and they are confident that we create the right environment for parallel IO to work. When I activate PARALLEL_IO and HDF5, I get the following error message:


cd Build; /apps/openmpi/wrapper/fortran/mpif90 -c -heap-arrays -fp-model precise -ip -O3 nf_fread2d.f90
nf_fread2d.f90(306): error #6285: There is no matching specific subroutine for this generic subroutine call. [MP_COLLECT]
CALL mp_collect (ng, model, Npts, IniVal, Awrk)
-----------------^
compilation aborted for nf_fread2d.f90 (code 1)
make: *** [Build/nf_fread2d.o] Error 1


What seems to work instead is activating INLINE_2D, what, if I understand correctly, chunks 3D IO fileds into 2D fields. I guess it makes IO very slow.

Why should I favour x over y? My domain is almost square (x is 18% larger than y). Would you still recommend starting profiling with something like 28*56?

Thanks

konsole
Posts: 11
Joined: Wed May 04, 2016 12:51 pm
Location: Institute for Marine and Antarctic Science UTAS

Re: huge memory per node when up scaling application

#7 Unread post by konsole »

The bug with mpi_collect has been fixed in the latest version of ROMS. I've managed to compile updating nf_fread2d.f90.

Now I get a different error during initialization:

INITIAL: Configuring and initializing forward nonlinear model ...
*******

NETCDF_OPEN - unable to open existing NetCDF file for parallel access:
waom10_grd.nc
call from: get_grid.F
[...]

ROMS/TOMS - Output NetCDF summary for Grid 01:

ROMS/TOMS - Output error ............ exit_flag: 3


ERROR: Abnormal termination: NetCDF OUTPUT.
REASON: NetCDF: Invalid argument

I've made use all my input files are converted to the same netcdf4 versions I'm running and compiling the application with using ncdump infile.nc | ncgen -b -O infile.nc

Do you know what I might be missing here?

Thanks,

ole

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: huge memory per node when up scaling application

#8 Unread post by kate »

Yes, you need to use ncgen with a format_code or format_name that gives you a netCDF4 (HDF5) file, i.e., the "-k nc4" option or maybe "-k nc7".

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: huge memory per node when up scaling application

#9 Unread post by kate »

Also, you can use nccopy instead of ncdump | ncgen.

konsole
Posts: 11
Joined: Wed May 04, 2016 12:51 pm
Location: Institute for Marine and Antarctic Science UTAS

Re: huge memory per node when up scaling application

#10 Unread post by konsole »

I get the same error with both, nc4 and nc7. Did it ever work for you Kate? If so, which versions of netcdf4, openmpi and hdf5 did you use?

Thanks

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: huge memory per node when up scaling application

#11 Unread post by kate »

To be honest, I refused to convert all my input files, so I split PARALLEL_IO into PARALLEL_IN and PARALLEL_OUT in my ROMS code. It's been years since I tried any of these things.

As for the rest, I'm working towards switching from ROMS to MOM6 in part because it is designed to run on thousands of cores.

User avatar
arango
Site Admin
Posts: 1367
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: huge memory per node when up scaling application

#12 Unread post by arango »

PARALLEL_IO works for me. You need to use a recent version of the code. Also, make sure that all your input NetCDF files are NetCDF-4 format. You can check which format you have using ncdump:

Code: Select all

% ncdump -k filename.nc
netCDF-4
To convert from NetCDF-3 to Netcdf-4 try:

Code: Select all

% nccopy -k netCDF-4 nc3name.nc nc4name.nc
I always keep the nc3 and nc4 files in separate directories.

By the way, if you Google how to do this you will find detailed :arrow: information.

konsole
Posts: 11
Joined: Wed May 04, 2016 12:51 pm
Location: Institute for Marine and Antarctic Science UTAS

Re: huge memory per node when up scaling application

#13 Unread post by konsole »

ok, I think we are a step further here. I've managed to run PARALLEL_IO successfully with the most recent ROMS version and the UPWELLING and INLET_TEST test cases. It also works for my application, when I just write out 2D fields.

However, when I start writing out 3D fields the run crashes with a mysterious segmentation fault. I deactivated PARALLEL_IO and I get an error at the same point which might have more info:

Code: Select all

Basin information for Grid 01:

 Maximum grid stiffness ratios:  rx0 =   3.558075E-01 (Beckmann and Haidvogel)
                                 rx1 =   1.823968E+02 (Haney)

 Initial domain volumes:  TotVolume =  6.4754640632E+16 m3
                         MinCellVol =  5.6669863450E+07 m3
                         MaxCellVol =  4.6874977906E+10 m3
                            Max/Min =  8.2715882926E+02

 NL ROMS/TOMS: started time-stepping: (Grid: 01 TimeSteps: 000000000001 - 000000001440)


 TIME-STEP YYYY-MM-DD hh:mm:ss.ss  KINETIC_ENRG   POTEN_ENRG    TOTAL_ENRG    NET_VOLUME
                     C => (i,j,k)       Cu            Cv            Cw         Max Speed

         0 2007-01-01 00:00:00.00  0.000000E+00  1.930242E+04  1.930242E+04  6.742933E+16
                     (000,000,00)  0.000000E+00  0.000000E+00  0.000000E+00  0.000000E+00
      DEF_HIS     - creating  history      file, Grid 01: ocean_his.nc
[r84:19289:0:19289] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffc56c123a8)
[r84:19290:0:19290] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fffc4fc3528)
[...]
==== backtrace ====
==== backtrace ====
 0 0x000000000047b79d nf_fwrite3d_mod_mp_nf_fwrite3d_()  ???:0
 1 0x000000000072a788 wrt_his_()  ???:0
 2 0x0000000000488b46 output_()  ???:0
 3 0x0000000000480928 main3d_()  ???:0
 4 0x000000000040c9c9 ocean_control_mod_mp_roms_run_()  ???:0
 5 0x000000000040c794 MAIN__()  ???:0
 6 0x000000000040c59e main()  ???:0
 7 0x000000000001ed1d __libc_start_main()  ???:0
 8 0x000000000040c4a9 _start()  ???:0
===================
==== backtrace ====
 0 0x000000000047b79d nf_fwrite3d_mod_mp_nf_fwrite3d_()  ???:0
 1 0x000000000072a788 wrt_his_()  ???:0
 2 0x0000000000488b46 output_()  ???:0
[...]
The weird thing is that it works when I don't write out any 3D fields or using the roms test cases mentioned above (2d and 3d write out). With some trouble shooting, I've narrowed the source down to the grid file provided. The exact same application works when I provide the inlet_test_grid.nc, but does not, when I change the 2 fields in the grid file to the ones needed for my application. I've checked all the attributes and formats of the grid variables and hope I haven't overlooked a small detail. The only difference I see is the size. My application now is 630x530x31 instead of 75x70x16 from the inlet test case.
I compared the backtraced nf_fwrite3d.f90 from this roms version to the one I've used before (2016). And indeed mp_gather3d had some major updates. Also, I've tried with more memory (96GB), to make sure this end is backed up and all files are netcdf-4 format.

I'm a bit stuck here. Did anyone recognise the issue or has any ideas what could cause it? The code seems fine, but I can't spot the mistake in my grid file neither.

Thanks,

ole

User avatar
arango
Site Admin
Posts: 1367
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: huge memory per node when up scaling application

#14 Unread post by arango »

I usually put lots of information in trac when I update the code. We have noticed that this information is generally ignored and our emails are treated as spam. There is nothing that we can do about it, and we cannot force users to update and or read the detailed information provided. We have thousands of users, and we don't have the time to look for answers to questions in this forum that are incomplete and lacking enough details for us to diagnose the problem. If you want to get out our attention and curiosity, you need to provide enough information.

Copying the error here doesn't tell us much. For example, you just said that two fields were changed in the grid NetCDF file but do not elaborate on the changes. What computer architecture are you using? memory limitation per node? compiler? and so on ...

I think that you need to read and understand the information provided in the following :arrow: trac ticket about MPI collective communications and NetCDF-4 parallel I/O.

I said above that one always needs to find the optimal parallel partition and the number CPUs for an application. I mentioned the concept of vectorization in the X-direction, and you missed the point that I was trying to make. It has nothing to do with the shape of the grid, square or not. You should run the BENCHMARK application that comes with ROMS. It can be configured with 512x64x30, 1024x128x30, or 2048x256x30 points. You may activate I/O or not. Try to run the large application with 1x32, 1x64, 1x128 partitions and then try to increase the number of CPUs in X. Which one is faster? Then, translate that concept to you large application. You need to be aware that too many tile partitions (or CPUs) if they are not optimal, it will affect performance when using distributed-memory because of the excessive communications between nodes. In shared-memory, you don't have this kind of problem but all the state arrays are global. Then, the limiting factor is fitting the arrays in memory.

The fact that you need to activate INLINE_2DIO to run successfully, it is telling me that you have limited memory requirements. Is memory need is from stack or heap? Your application is blowing-up because of memory corruption. That's what all those errors libpthread-2.12 at the top are saying.

konsole
Posts: 11
Joined: Wed May 04, 2016 12:51 pm
Location: Institute for Marine and Antarctic Science UTAS

Re: huge memory per node when up scaling application

#15 Unread post by konsole »

I'm working on a distributed memory cluster that offers nodes with either 16 processors (32 or 64gb memory available), or with 28 processors (128 or 256 gb memory available). I use the following software to compile and run roms: intel-fc/2018.1.163, netcdf/4.6.1p (p stands for parallel), openmpi/3.1.0 and hdf5.

I've did the benchmarking as suggested and fund an optimal configuration, which indeed does not contain as much processors as I would have thought.
However for this configuration to run, I have to request exclusively the nodes with 265 gb memory. If I request nodes with less memory, I get a memory error. Requesting just high memory nodes results in a PBS queueing time on the cluster of several days, since there are just a few of these nodes available. If I could request smaller memory nodes (e.g. 128gb per node) I have more nodes available and queuing time shortens.

We suspect ROMS IO via one process to be responsible for the high memory demand. Therefore we hope to lower the memory need with PARALLEL_IO.

Following the trac ticket you've pointed to https://www.myroms.org/projects/src/ticket/747 I need to activate the at least the cpp options:

PARALLEL_IO
HDF5
COLLECT_ALLREDUCE

and make sure all input files are netCDF-4.

Can you confirm that using PARALLEL_IO will reduce memory needs, if IO is the issue? Is there a way to test it before I merge our version of ROMS with the latest official version of ROMS to get PARALLEL_IO.

How do I find out if memory need is from stack or heap?

Thanks

Post Reply