Parallel IO not scaling over Serial IO for ROMS

Message

koushik · #1 Unread post by **koushik** » Thu Dec 05, 2019 11:25 am

Hello All,

I am working on IO scalability for ROMS model in Cray XC40 machine.

I have some observations and some questions as below -

1. When I am not writing any netcdf output file , I observed the model is scaling upto 960 processors(PEs) only.Any reason for that?
(Strong Scaling Results -- Varing PEs with fixed problem size).
[Reference -- Figure 1]

2.When I am writing with output netcdf files with USE_PARALLEL_IO flag (screenshot attached) turned ON(Parallel NetCDF) and OFF(Serial NetCDF) , I observed as below :
a. Both serial and parallel IO is scaling upto 240 PEs only. Beyond that it is increasing a lot.Any reason for that?
b. Parallel IO is taking more time than serial IO. Can you please let me know what all flags I need to turn ON for NetCDF Parallel IO?
(I have plotted the IO time by calulating the difference in execution time between experiments with NetCDF files as output and without any NetCDF output file.)
[Reference -- Figure 2]

-->I have check with print statements that both PARALLEL_IO and HDF5 macro is NOT turned ON even when USE_PARALLEL_IO flag is ON.
-->For both serial as well as parallel IO only DISTRIBUTE macro is ON.
-->The NetCDF file format is nf90_64bit_offset for both cases.

3. As without IO the model is scaling upto 960 PEs and without IO the model is scaling upto 240 PEs , is there any way by which I can select only a subset of PEs for IO part?
[Reference -- Figure 1 and 2]

Can you please check the above issues and let me know the reasons for such results and solutions for the same.

Regards,
Koushik Sen

kate · #2 Unread post by **kate** » Thu Dec 05, 2019 4:46 pm

Many years ago our supercomputer center bought a Cray and an IBM. The timings I did without I/O led me to believe that the Cray was the machine for me. However, my real jobs have I/O and that meant that the IBM was actually faster for my problems. At that time, I didn't have a vectorized netCDF library on the Cray and it was therefore quite slow. Your results do not surprise me.

As for parallel I/O, what code are you using? I tried splitting PARALLEL_IO into PARALLEL_IN and PARALLEL_OUT so that I wouldn't have to convert all my inputs to HDF5, yet still get the benefits of parallel output. It's been a while since I checked, but it didn't speed things up. I heard that at that time, there was slowness in the netCDF implementation of writing to HDF5.

NCAR has an annual workshop on these sorts of software engineering issues. The time I got to go there was a presentation on writing files in parallel. Their suggestion was that for the Texas stampede supercomputer, you want to have one core writing per 18-core node. If you can have your writers not also compute, that helps. Packages like NCAR's parallel_io library can manage splitting the writing from the computing, but I didn't get that to work for ROMS some years ago. The NCAR package is now on github: https://github.com/NCAR/ParallelIO and is undergoing active development, so it might be worth trying again.

arango · #3 Unread post by **arango** » Thu Dec 05, 2019 4:54 pm

Great study. We are missing important information about your simulations. We need the grid size and the parallel domain decomposition. I have mentioned frequently in this forum that ROMS scalability is a function of the application and depends on the domain size and parallel partition. There is always an optimal number of PEs for an application in distributed-memory. As the tiles became very small because of the increasing number of PEs, the scalability is heavy penalized by excessive MPI collective communications during the gathering/scattering operations (I/O) and parallel exchanges between tiles. Usually, it is advantageous to have more partitions in the J-direction (NtileJ) than the I-direction (NtileI) to facilitate vectorization. It is the Fortran memory order: I-element is the rapidly changing index. We need to avoid excessive memory page faulting.

I recommend that you study the following

trac ticket discusses how to optimize MPI communication and NetCDF-4 parallel (I/O). I have never had access to a supercomputer with native parallel I/O software and hardware architecture to test and examine the scalability of the NetCDF4/HDF5 library. My experience is similar to yours. The serial I/O is quite efficient and the NetCDF4 is a little inefficient and very difficult to examine in the parallel TotalView debugger because it had lots of recurrency calls. Of course, that was years ago. I don't how much it has changed since them.

I don't know anything about the Cray XC40 to see what would happen if we do computations in single precision (SINGLE_PRECISION CPP). We will need to have the proper kind definition on ROMS/Modules/mod_kinds.F. Another CPP definition that one may turn on and off is OUT_DOUBLE to have output NetCDF variables in single or double precision. I am sure that will change your scalability figures.

It is clear in figures what is your optimal number of PEs. I will pay a lot of attention at the end of ROMS standard output to check wich operations are consuming the CPU.

jivica · #4 Unread post by **jivica** » Fri Dec 06, 2019 5:00 am

I'm using CRAY XC40 and must admit, quite happy with performance (dragonfly net topology), 24 core nodes.
Made some benchmarking and it really depends on your application and grid size. Note that depending on users load on scratch you can have totally different results (and you can tweak parallel write/read with set_stripe on lustre), what optimisation you are using etc. Have a look at the older post where I included plot for speedup:
viewtopic.php?f=16&t=4854&p=18873&hilit=CRAY#p18873

Cheers,
I.

koushik · #5 Unread post by **koushik** » Fri Dec 06, 2019 5:54 pm

arango wrote: Thu Dec 05, 2019 4:54 pm Great study. We are missing important information about your simulations. We need the grid size and the parallel domain decomposition. I have mentioned frequently in this forum that ROMS scalability is a function of the application and depends on the domain size and parallel partition. There is always an optimal number of PEs for an application in distributed-memory. As the tiles became very small because of the increasing number of PEs, the scalability is heavy penalized by excessive MPI collective communications during the gathering/scattering operations (I/O) and parallel exchanges between tiles. Usually, it is advantageous to have more partitions in the J-direction (NtileJ) than the I-direction (NtileI) to facilitate vectorization. It is the Fortran memory order: I-element is the rapidly changing index. We need to avoid excessive memory page faulting.

I recommend that you study the following trac ticket discusses how to optimize MPI communication and NetCDF-4 parallel (I/O). I have never had access to a supercomputer with native parallel I/O software and hardware architecture to test and examine the scalability of the NetCDF4/HDF5 library. My experience is similar to yours. The serial I/O is quite efficient and the NetCDF4 is a little inefficient and very difficult to examine in the parallel TotalView debugger because it had lots of recurrency calls. Of course, that was years ago. I don't how much it has changed since them.

I don't know anything about the Cray XC40 to see what would happen if we do computations in single precision (SINGLE_PRECISION CPP). We will need to have the proper kind definition on ROMS/Modules/mod_kinds.F. Another CPP definition that one may turn on and off is OUT_DOUBLE to have output NetCDF variables in single or double precision. I am sure that will change your scalability figures.

It is clear in figures what is your optimal number of PEs. I will pay a lot of attention at the end of ROMS standard output to check wich operations are consuming the CPU.

Thanks for the information and suggestions.

Please find below the parameters I have used for the experiments:
Ngrids = 1 !Number of nested grids
NestLayers = 1 !Number of grid nesting layers.
GridsInLayer = 1 !Number of grids in each nesting layer
Lm == 899 ! Number of I-direction INTERIOR RHO-points
Mm == 629 ! Number of J-direction INTERIOR RHO-points
N == 40 ! Number of vertical levels

1.Is there any method by which I can assign a subset of PEs for the IO in ROMS setup (Let's say one PE per node only for IO or something related) ?

2.Can you please let me how should I set flags SINGLE_PRECISION and OUT_DOUBLE?

3.In portal it is mentioned "In ROMS, the distributed memory I/O is all happening on the master process (0) unless you specifically ask it to use MPI-I/O, which requires both the NETCDF4 and PARALLEL_IO cpp flags to be defined."
Even after turning both of them ON in the build script , I checked by adding print statements that the PARALLEL_IO and HDF5 macro is not activated and only DISTRIBUTE macro is activated for both serial and parallel NetCDF.
Can you help me with some idea of how to debug it further?

4.Is there any other flags other than USE_PARALLEL_IO and USE_NETCDF4 that need to be turned on to use parallel NetCDF for ROMS?

5.Please let me know what are the optimal values for these parameters for serial NetCDF and Parallel NetCDF?
! NetCDF-4/HDF5 compression parameters for output files.
NC_SHUFFLE = 1 ! if non-zero, turn on shuffle filter
NC_DEFLATE = 1 ! if non-zero, turn on deflate filter
NC_DLEVEL = 1 ! deflate level [0-9]

Please let me know if you need some other details regarding the setup.

kate · #6 Unread post by **kate** » Fri Dec 06, 2019 6:06 pm

As for parallel I/O, what code are you using? I tried splitting PARALLEL_IO into PARALLEL_IN and PARALLEL_OUT so that I wouldn't have to convert all my inputs to HDF5, yet still get the benefits of parallel output. It's been a while since I checked, but it didn't speed things up. I heard that at that time, there was slowness in the netCDF implementation of writing to HDF5.

You didn't answer this question.

koushik · #7 Unread post by **koushik** » Fri Dec 06, 2019 6:40 pm

jivica wrote: Fri Dec 06, 2019 5:00 am I'm using CRAY XC40 and must admit, quite happy with performance (dragonfly net topology), 24 core nodes.
Made some benchmarking and it really depends on your application and grid size. Note that depending on users load on scratch you can have totally different results (and you can tweak parallel write/read with set_stripe on lustre), what optimisation you are using etc. Have a look at the older post where I included plot for speedup:
viewtopic.php?f=16&t=4854&p=18873&hilit=CRAY#p18873

Cheers,
I.

Hello,
Thanks for the information.
The one I am using Intel Haswell 2.5 GHz based CPU cluster with 1376 nodes; each node has 2 CPU sockets with 12 cores each, 128GB RAM and connected using Cray Aries interconnect on a dragonfly topology.

From this "viewtopic.php?f=16&t=4854&p=18873&hilit=CRAY#p18873"
I checked that optimal is 20x20 per core.
I have the follwoing setup:
Lm == 899 ! Number of I-direction INTERIOR RHO-points
Mm == 629 ! Number of J-direction INTERIOR RHO-points
N == 40 ! Number of vertical levels
So it should be approx 45x32=1350 cores .

1. I think my results are consistent with your observations when IO is turned OFF in my case.(For my experiment, it is scaling to in between 1000 to 1400 PEs as in Figure 1).
But the scalabilty issue is coming in case the IO is turned ON.(Same model is scaling only upto 240 PEs.)
Can you please let me know what all optimizations are possible for improving the serial NetCDF or Parallel NetCDF?

2. Is there any optimal value to set the set_stripe , stripe_count on lustre based upon the file size and the grid size?

koushik · #8 Unread post by **koushik** » Fri Dec 06, 2019 6:56 pm

kate wrote: Thu Dec 05, 2019 4:46 pm Many years ago our supercomputer center bought a Cray and an IBM. The timings I did without I/O led me to believe that the Cray was the machine for me. However, my real jobs have I/O and that meant that the IBM was actually faster for my problems. At that time, I didn't have a vectorized netCDF library on the Cray and it was therefore quite slow. Your results do not surprise me.

As for parallel I/O, what code are you using? I tried splitting PARALLEL_IO into PARALLEL_IN and PARALLEL_OUT so that I wouldn't have to convert all my inputs to HDF5, yet still get the benefits of parallel output. It's been a while since I checked, but it didn't speed things up. I heard that at that time, there was slowness in the netCDF implementation of writing to HDF5.

NCAR has an annual workshop on these sorts of software engineering issues. The time I got to go there was a presentation on writing files in parallel. Their suggestion was that for the Texas stampede supercomputer, you want to have one core writing per 18-core node. If you can have your writers not also compute, that helps. Packages like NCAR's parallel_io library can manage splitting the writing from the computing, but I didn't get that to work for ROMS some years ago. The NCAR package is now on github: https://github.com/NCAR/ParallelIO and is undergoing active development, so it might be worth trying again.

Hello,
Thanks for the feedback.

1.I am using /ROMS/Modules/mod_netcdf.F for the NetCDF implementation.
2.How can PARALLEL_IN and PARALLEL_OUT be used separately?

kate · #9 Unread post by **kate** » Fri Dec 06, 2019 7:11 pm

Where did you get your ROMS code from? That's what I was asking. Hernan's code has PARALLEL_IO, but you say turning that on isn't doing anything. My code does not recognize PARALLEL_IO but instead recognizes PARALLEL_IN and PARALLEL_OUT. A quick check of cppdefs.h should tell you what the options are in *your* code.

koushik · #10 Unread post by **koushik** » Fri Dec 06, 2019 8:20 pm

kate wrote: Fri Dec 06, 2019 7:11 pm Where did you get your ROMS code from? That's what I was asking. Hernan's code has PARALLEL_IO, but you say turning that on isn't doing anything. My code does not recognize PARALLEL_IO but instead recognizes PARALLEL_IN and PARALLEL_OUT. A quick check of cppdefs.h should tell you what the options are in *your* code.

I have attached my cppdefs.h and build script for your reference.
I have PARALLEL_IO option in the cppdefs.h file.
I have downloaded using svn checkout https://www.myroms.org/svn/src/trunk option.

robertson · Fri Dec 06, 2019 9:27 pm

Please attach your gbplume.h file where your chosen CPP Definitions are set.

arango · #12 Unread post by **arango** » Fri Dec 06, 2019 11:37 pm

An application of size 899x629x40 is TOO small to be run in 1350 PEs

No wonder why is not scaling after 240 PEs. You are killing scalability with excessive MPI collective communications and exchanges. Your 20x20 tile size is too small, and it is killing vectorization. The good news is that you found the optimal number of PEs and domain decomposition for your application. That's what I always do in all my applications. The fact that supercomputer has thousands of PEs available doesn't mean that you need to use all of then to run your application faster and efficiently.

jivica · #13 Unread post by **jivica** » Sat Dec 07, 2019 1:21 am

First of all you should read your log file (at the end) where you have profiling of your run. There you can see what portion of time is spent in IO and what in real computing kernel etc.
In some cases when I need often IO I get bad ratio but this is to be expected.
Remember Sasa once told he had version with parallel write of each thread on their node /tmp/ and then he had to pay price in combining them all together *this is only for writing*
For example WRF has set dedicated threads only for IO.
I had cray specialist looking at the ROMS code to find some low hanging fruits for speedup, and without significant interventions in the code he couldn't do much.
New MPI has non-blocking options...

Question for Hernan;
How do you organize nested domains in multiple constellation (tiles)?
For example you have donor (easy ntile_j * ntile_i) and then in layer 2 you have 2 receiving grids with significantly smaller number of points. Running them in serial one after another or parallelizing with different number of tiles for each of them?

For set_stripe you make folder aware of that, Usually I use 8-16 stripes and it doesn't cost anything and is lustre feature.
Using SINGLE_PRECISION does give speedup in my case.

I wouldn't bother with compression inside ROMS, what I do is use NCO and new compression --ppc (precision) which is amazing (!) Smaller files then using short variables and don't have to worry about scalling fact. in each file (if you want to ncrcat them after)...

Cheers
I

koushik · #14 Unread post by **koushik** » Sat Dec 07, 2019 5:55 am

robertson wrote: Fri Dec 06, 2019 9:27 pm Please attach your gbplume.h file where your chosen CPP Definitions are set.

Please find attachec the gbplum.h file
Please let me know if something need to be added as I am new to ROMS.
I think my PARALLEL_IO is not turned on properly although I have turned on USE_PARALLEL_IO in build script.

jivica · #15 Unread post by **jivica** » Sun Dec 08, 2019 5:15 am

As a new user you should definitely read your log file (!) and pay some attention on the stuff inside.
There you can find all options that you activated (and which are used in ROMS, note that you can miss spell or use unsupported options during compilation, but they will not be in the log and not used in reality). Timings of the parts of the code are at the end, you can see what % of time code is spending in IO, 2D, turbulence etc, what are max CFLs, currents etc... zillion of things that you should be aware of.

Good luck!
I

Ocean Modeling Discussion

Parallel IO not scaling over Serial IO for ROMS

Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS

Re: Parallel IO not scaling over Serial IO for ROMS