Hello,
I just started using ROMS. I have been reading the forum and found this viewtopic.php?f=17&t=2001 very helpfull but still couldn't understand some issues.
I’m looking to optimise performance and would appreciate some insight into the following:
The configuration requires you to specify a number of nodes, through the arameters: NtileI and NtileJ. Each of these appears to initiate a separate process in memory. Currently we are running as many nodes as CPUs on the box - 8. I am using MPI (although I know, because I'm using only one machine, OpenMP could be faster). My grid size is 260x440x30 and tilling is 1x8 (I have tried different tillings like 2x4, 2x16, 1x32 as well as higher numbers, but the combination that seems to work better is 1x8). I am using an Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Whilst running the application, we see top reports the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18853 leonor 20 0 328188 240988 9744 R 99,0 1,5 436:28.01 oceanM
18855 leonor 20 0 334856 249036 9412 R 98,7 1,5 436:02.14 oceanM
18860 leonor 20 0 329680 240732 6660 R 98,7 1,5 436:29.05 oceanM
18856 leonor 20 0 384296 297268 8076 R 96,7 1,8 436:01.04 oceanM
18857 leonor 20 0 335136 247100 9132 R 96,0 1,5 435:59.11 oceanM
18859 leonor 20 0 337452 248204 8660 R 96,0 1,5 436:05.72 oceanM
18854 leonor 20 0 334576 248000 8216 R 95,0 1,5 436:14.05 oceanM
18858 leonor 20 0 384296 296992 7624 R 88,4 1,8 436:12.84 oceanM
However, I believe the above represent CPU totals across the system - i.e. a sum of each cpu usage. I would therefore expect the maximum to be potentially 800%. This is confirmed by toggling the calculation such that it is shown as the overall percentage in use. This intern shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18853 leonor 20 0 328188 240992 9748 R 12,4 1,5 439:39.78 oceanM
18854 leonor 20 0 334576 248000 8216 R 12,4 1,5 439:27.22 oceanM
18856 leonor 20 0 384296 297272 8080 R 12,4 1,8 439:12.60 oceanM
18859 leonor 20 0 337452 248204 8660 R 12,2 1,5 439:19.38 oceanM
18858 leonor 20 0 384296 296924 7624 R 11,8 1,8 439:25.76 oceanM
18860 leonor 20 0 329680 240732 6660 R 11,7 1,5 439:42.28 oceanM
18857 leonor 20 0 335136 247100 9132 R 11,6 1,5 439:11.46 oceanM
18855 leonor 20 0 334856 249036 9412 R 11,3 1,5 439:15.54 oceanM
It therefore appears we are currently only utilising around 1/8 of our system CPU capacity. The above stats also suggest we aren’t Memory bound.
A few questions:
1. Is there anyway to make it run at a higher capacity of the CPU's? Is there anything I should have defined/alocate previously?
2. Each node appears to request around 300MB of virtual memory, utilising around 80% of this. Can this be increased? Any benefit of doing so?
Thanks very much,
Leonor
Optimise performance
Re: Optimise performance
I think more interesting is the report you can get out of ROMS directly. This is for a two-grid problem and each grid reports separately:
Well, maybe that's a bad example. It is spending all the time communicating. Here's one that's "more normal":
Now, maybe you aren't getting this report. The ROMS trunk code differs from mine by (at least):
For some reason, the trunk code is set to only give this report when you run with two processes.
Code: Select all
Nonlinear ocean model elapsed time profile:
Allocation and array initialization .............. 112.384 ( 0.0017 %)
Ocean state initialization ....................... 1114.664 ( 0.0170 %)
Reading of input data ............................ 12428.550 ( 0.1894 %)
Processing of input data ......................... 2187.741 ( 0.0333 %)
Processing of output time averaged data .......... 9488.126 ( 0.1446 %)
Computation of vertical boundary conditions ...... 112.246 ( 0.0017 %)
Computation of global information integrals ...... 2999.621 ( 0.0457 %)
Writing of output data ........................... 13194.447 ( 0.2010 %)
Model 2D kernel .................................. 31972.037 ( 0.4872 %)
2D/3D coupling, vertical metrics ................. 6629.348 ( 0.1010 %)
Omega vertical velocity .......................... 3902.836 ( 0.0595 %)
Equation of state for seawater ................... 10266.417 ( 0.1564 %)
Atmosphere-Ocean bulk flux parameterization ...... 1529.677 ( 0.0233 %)
KPP vertical mixing parameterization ............. 61375.483 ( 0.9352 %)
3D equations right-side terms .................... 7992.865 ( 0.1218 %)
3D equations predictor step ...................... 24004.809 ( 0.3658 %)
Pressure gradient ................................ 8034.150 ( 0.1224 %)
Harmonic mixing of tracers, geopotentials ........ 8970.825 ( 0.1367 %)
Biharmonic mixing of tracers, geopotentials ...... 935.159 ( 0.0142 %)
Harmonic stress tensor, S-surfaces ............... 4870.250 ( 0.0742 %)
Corrector time-step for 3D momentum .............. 12262.839 ( 0.1868 %)
Corrector time-step for tracers .................. 12897.121 ( 0.1965 %)
Total: 237281.597 3.6154
Nonlinear model message Passage profile:
Message Passage: 2D halo exchanges ............... 21212.639 ( 0.3232 %)
Message Passage: 3D halo exchanges ............... 12521.340 ( 0.1908 %)
Message Passage: 4D halo exchanges ............... 6038.609 ( 0.0920 %)
Message Passage: data broadcast .................. 19180.649 ( 0.2923 %)
Message Passage: data reduction .................. 601.964 ( 0.0092 %)
Message Passage: data gathering .................. 5803.062 ( 0.0884 %)
Message Passage: data scattering.................. 269.526 ( 0.0041 %)
Message Passage: point data gathering ............ 3059755.879 (46.6212 %)
Total: 3125383.668 47.6211
All percentages are with respect to total time = 6563020.214
Nonlinear ocean model elapsed time profile:
Allocation and array initialization .............. 112.384 ( 0.0017 %)
Ocean state initialization ....................... 1105.705 ( 0.0168 %)
Reading of input data ............................ 4683.848 ( 0.0714 %)
Processing of input data ......................... 2304.851 ( 0.0351 %)
Processing of output time averaged data .......... 19134.439 ( 0.2915 %)
Computation of vertical boundary conditions ...... 235.436 ( 0.0036 %)
Computation of global information integrals ...... 6485.987 ( 0.0988 %)
Writing of output data ........................... 9959.015 ( 0.1517 %)
Model 2D kernel .................................. 49022.611 ( 0.7470 %)
2D/3D coupling, vertical metrics ................. 14726.980 ( 0.2244 %)
Omega vertical velocity .......................... 8130.777 ( 0.1239 %)
Equation of state for seawater ................... 22005.527 ( 0.3353 %)
Atmosphere-Ocean bulk flux parameterization ...... 3123.783 ( 0.0476 %)
KPP vertical mixing parameterization ............. 111743.172 ( 1.7026 %)
3D equations right-side terms .................... 14120.397 ( 0.2152 %)
3D equations predictor step ...................... 46904.361 ( 0.7147 %)
Pressure gradient ................................ 15586.708 ( 0.2375 %)
Harmonic mixing of tracers, geopotentials ........ 16389.330 ( 0.2497 %)
Biharmonic mixing of tracers, geopotentials ...... 2293.830 ( 0.0350 %)
Harmonic stress tensor, S-surfaces ............... 7354.066 ( 0.1121 %)
Corrector time-step for 3D momentum .............. 25211.450 ( 0.3841 %)
Corrector time-step for tracers .................. 23637.905 ( 0.3602 %)
Total: 404272.563 6.1599
Nonlinear model message Passage profile:
Message Passage: 2D halo exchanges ............... 27410.131 ( 0.4176 %)
Message Passage: 3D halo exchanges ............... 37022.628 ( 0.5641 %)
Message Passage: 4D halo exchanges ............... 20618.869 ( 0.3142 %)
Message Passage: data broadcast .................. 9122.282 ( 0.1390 %)
Message Passage: data reduction .................. 1420.559 ( 0.0216 %)
Message Passage: data gathering .................. 5002.919 ( 0.0762 %)
Message Passage: data scattering.................. 186.045 ( 0.0028 %)
Message Passage: point data gathering ............ 2627603.386 (40.0365 %)
Total: 2728386.819 41.5721
All percentages are with respect to total time = 6563021.672
Code: Select all
Nonlinear ocean model elapsed time profile:
Allocation and array initialization .............. 60.432 ( 0.0014 %)
Ocean state initialization ....................... 2.421 ( 0.0001 %)
Reading of input data ............................ 69839.815 ( 1.5856 %)
Processing of input data ......................... 41372.578 ( 0.9393 %)
Processing of output time averaged data .......... 12087.382 ( 0.2744 %)
Computation of vertical boundary conditions ...... 28850.006 ( 0.6550 %)
Computation of global information integrals ...... 34279.083 ( 0.7782 %)
Writing of output data ........................... 575421.435 (13.0637 %)
Model 2D kernel .................................. 742172.451 (16.8494 %)
Tidal forcing .................................... 133766.955 ( 3.0369 %)
2D/3D coupling, vertical metrics ................. 88575.049 ( 2.0109 %)
Omega vertical velocity .......................... 102046.674 ( 2.3167 %)
Equation of state for seawater ................... 80365.537 ( 1.8245 %)
Atmosphere-Ocean bulk flux parameterization ...... 22071.669 ( 0.5011 %)
KPP vertical mixing parameterization ............. 947214.453 (21.5044 %)
3D equations right-side terms .................... 83269.923 ( 1.8905 %)
3D equations predictor step ...................... 222261.690 ( 5.0460 %)
Pressure gradient ................................ 52030.417 ( 1.1812 %)
Harmonic stress tensor, S-surfaces ............... 53737.699 ( 1.2200 %)
Corrector time-step for 3D momentum .............. 125120.118 ( 2.8406 %)
Corrector time-step for tracers .................. 122006.579 ( 2.7699 %)
Total: 3536552.365 80.2897
Nonlinear sea-ice model elapsed time profile:
Ice thermodynamics................................ 47892.375 ( 1.0873 %)
Ice rheology coefficients......................... 210882.943 ( 4.7876 %)
Iterative solver of ice dynamics.................. 532130.993 (12.0809 %)
Advection of ice tracers.......................... 8001.462 ( 0.1817 %)
Total: 798907.773 18.1374
Nonlinear model message Passage profile:
Message Passage: 2D halo exchanges ............... 1260343.623 (28.6133 %)
Message Passage: 3D halo exchanges ............... 469817.082 (10.6662 %)
Message Passage: 4D halo exchanges ............... 173284.373 ( 3.9340 %)
Message Passage: data broadcast .................. 370142.718 ( 8.4033 %)
Message Passage: data reduction .................. 42863.424 ( 0.9731 %)
Message Passage: data gathering .................. 176922.035 ( 4.0166 %)
Message Passage: data scattering.................. 110.137 ( 0.0025 %)
Message Passage: boundary data gathering ......... 41652.886 ( 0.9456 %)
Message Passage: point data gathering ............ 16230.007 ( 0.3685 %)
Total: 2551366.286 57.9232
All percentages are with respect to total time = 4404741.805
Code: Select all
--- ../roms/trunk/ROMS/Utility/timers.F 2015-03-26 10:56:13.819755622 -0800
+++ ROMS/Utility/timers.F 2015-06-25 10:04:26.946864845 -0800
@@ -176,7 +176,7 @@
END DO
END IF
END DO
- IF (thread_count.eq.numthreads) THEN
+! IF (thread_count.eq.numthreads) THEN
thread_count=0
#ifdef DISTRIBUTE
op_handle(0:Nregion)='SUM'
@@ -269,7 +302,7 @@
60 FORMAT (/,' All percentages are with respect to total time =',&
& 5x,f12.3)
#endif
- END IF
+! END IF
!$OMP END CRITICAL (FINALIZE_WCLOCK)
END IF
RETURN
Re: Optimise performance
Thanks for you quick reply.
No, in fact I don't have this report. This is very interesting, I will try to adapt this somehow.
I will try with OpenMP and also reduce the printed output, that might make it faster.
But I think my main question is why is ROMS only using about 12% of the CPU capacity? I am not running anything else on this machine. Should I have predefined or allocate something when installing the MPI? Is there a way to allocate more memory for ROMS to use? Or do you think this is just a problem with my machine?
Thank you!
Leonor
No, in fact I don't have this report. This is very interesting, I will try to adapt this somehow.
I will try with OpenMP and also reduce the printed output, that might make it faster.
But I think my main question is why is ROMS only using about 12% of the CPU capacity? I am not running anything else on this machine. Should I have predefined or allocate something when installing the MPI? Is there a way to allocate more memory for ROMS to use? Or do you think this is just a problem with my machine?
Thank you!
Leonor
Re: Optimise performance
If you download a fresh ROMS-du-jour, it should have the report (better fix than mine). I don't know the answer to the rest.