Optimise performance

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
leonor
Posts: 2
Joined: Mon Apr 20, 2015 1:20 pm
Location: CENTEC

Optimise performance

#1 Unread post by leonor »

Hello,
I just started using ROMS. I have been reading the forum and found this viewtopic.php?f=17&t=2001 very helpfull but still couldn't understand some issues.
I’m looking to optimise performance and would appreciate some insight into the following:

The configuration requires you to specify a number of nodes, through the arameters: NtileI and NtileJ. Each of these appears to initiate a separate process in memory. Currently we are running as many nodes as CPUs on the box - 8. I am using MPI (although I know, because I'm using only one machine, OpenMP could be faster). My grid size is 260x440x30 and tilling is 1x8 (I have tried different tillings like 2x4, 2x16, 1x32 as well as higher numbers, but the combination that seems to work better is 1x8). I am using an Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

Whilst running the application, we see top reports the following:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18853 leonor 20 0 328188 240988 9744 R 99,0 1,5 436:28.01 oceanM
18855 leonor 20 0 334856 249036 9412 R 98,7 1,5 436:02.14 oceanM
18860 leonor 20 0 329680 240732 6660 R 98,7 1,5 436:29.05 oceanM
18856 leonor 20 0 384296 297268 8076 R 96,7 1,8 436:01.04 oceanM
18857 leonor 20 0 335136 247100 9132 R 96,0 1,5 435:59.11 oceanM
18859 leonor 20 0 337452 248204 8660 R 96,0 1,5 436:05.72 oceanM
18854 leonor 20 0 334576 248000 8216 R 95,0 1,5 436:14.05 oceanM
18858 leonor 20 0 384296 296992 7624 R 88,4 1,8 436:12.84 oceanM

However, I believe the above represent CPU totals across the system - i.e. a sum of each cpu usage. I would therefore expect the maximum to be potentially 800%. This is confirmed by toggling the calculation such that it is shown as the overall percentage in use. This intern shows:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18853 leonor 20 0 328188 240992 9748 R 12,4 1,5 439:39.78 oceanM
18854 leonor 20 0 334576 248000 8216 R 12,4 1,5 439:27.22 oceanM
18856 leonor 20 0 384296 297272 8080 R 12,4 1,8 439:12.60 oceanM
18859 leonor 20 0 337452 248204 8660 R 12,2 1,5 439:19.38 oceanM
18858 leonor 20 0 384296 296924 7624 R 11,8 1,8 439:25.76 oceanM
18860 leonor 20 0 329680 240732 6660 R 11,7 1,5 439:42.28 oceanM
18857 leonor 20 0 335136 247100 9132 R 11,6 1,5 439:11.46 oceanM
18855 leonor 20 0 334856 249036 9412 R 11,3 1,5 439:15.54 oceanM

It therefore appears we are currently only utilising around 1/8 of our system CPU capacity. The above stats also suggest we aren’t Memory bound.

A few questions:

1. Is there anyway to make it run at a higher capacity of the CPU's? Is there anything I should have defined/alocate previously?
2. Each node appears to request around 300MB of virtual memory, utilising around 80% of this. Can this be increased? Any benefit of doing so?

Thanks very much,
Leonor

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Optimise performance

#2 Unread post by kate »

I think more interesting is the report you can get out of ROMS directly. This is for a two-grid problem and each grid reports separately:

Code: Select all

 Nonlinear ocean model elapsed time profile:

  Allocation and array initialization ..............       112.384  ( 0.0017 %)
  Ocean state initialization .......................      1114.664  ( 0.0170 %)
  Reading of input data ............................     12428.550  ( 0.1894 %)
  Processing of input data .........................      2187.741  ( 0.0333 %)
  Processing of output time averaged data ..........      9488.126  ( 0.1446 %)
  Computation of vertical boundary conditions ......       112.246  ( 0.0017 %)
  Computation of global information integrals ......      2999.621  ( 0.0457 %)
  Writing of output data ...........................     13194.447  ( 0.2010 %)
  Model 2D kernel ..................................     31972.037  ( 0.4872 %)
  2D/3D coupling, vertical metrics .................      6629.348  ( 0.1010 %)
  Omega vertical velocity ..........................      3902.836  ( 0.0595 %)
  Equation of state for seawater ...................     10266.417  ( 0.1564 %)
  Atmosphere-Ocean bulk flux parameterization ......      1529.677  ( 0.0233 %)
  KPP vertical mixing parameterization .............     61375.483  ( 0.9352 %)
  3D equations right-side terms ....................      7992.865  ( 0.1218 %)
  3D equations predictor step ......................     24004.809  ( 0.3658 %)
  Pressure gradient ................................      8034.150  ( 0.1224 %)
  Harmonic mixing of tracers, geopotentials ........      8970.825  ( 0.1367 %)
  Biharmonic mixing of tracers, geopotentials ......       935.159  ( 0.0142 %)
  Harmonic stress tensor, S-surfaces ...............      4870.250  ( 0.0742 %)
  Corrector time-step for 3D momentum ..............     12262.839  ( 0.1868 %)
  Corrector time-step for tracers ..................     12897.121  ( 0.1965 %)
                                              Total:    237281.597    3.6154
 Nonlinear model message Passage profile:

  Message Passage: 2D halo exchanges ...............     21212.639  ( 0.3232 %)
  Message Passage: 3D halo exchanges ...............     12521.340  ( 0.1908 %)
  Message Passage: 4D halo exchanges ...............      6038.609  ( 0.0920 %)
  Message Passage: data broadcast ..................     19180.649  ( 0.2923 %)
  Message Passage: data reduction ..................       601.964  ( 0.0092 %)
  Message Passage: data gathering ..................      5803.062  ( 0.0884 %)
  Message Passage: data scattering..................       269.526  ( 0.0041 %)
  Message Passage: point data gathering ............   3059755.879  (46.6212 %)
                                              Total:   3125383.668   47.6211

 All percentages are with respect to total time =      6563020.214

 Nonlinear ocean model elapsed time profile:

  Allocation and array initialization ..............       112.384  ( 0.0017 %)
  Ocean state initialization .......................      1105.705  ( 0.0168 %)
  Reading of input data ............................      4683.848  ( 0.0714 %)
  Processing of input data .........................      2304.851  ( 0.0351 %)
  Processing of output time averaged data ..........     19134.439  ( 0.2915 %)
  Computation of vertical boundary conditions ......       235.436  ( 0.0036 %)
  Computation of global information integrals ......      6485.987  ( 0.0988 %)
  Writing of output data ...........................      9959.015  ( 0.1517 %)
  Model 2D kernel ..................................     49022.611  ( 0.7470 %)
  2D/3D coupling, vertical metrics .................     14726.980  ( 0.2244 %)
  Omega vertical velocity ..........................      8130.777  ( 0.1239 %)
  Equation of state for seawater ...................     22005.527  ( 0.3353 %)
  Atmosphere-Ocean bulk flux parameterization ......      3123.783  ( 0.0476 %)
  KPP vertical mixing parameterization .............    111743.172  ( 1.7026 %)
  3D equations right-side terms ....................     14120.397  ( 0.2152 %)
  3D equations predictor step ......................     46904.361  ( 0.7147 %)
  Pressure gradient ................................     15586.708  ( 0.2375 %)
  Harmonic mixing of tracers, geopotentials ........     16389.330  ( 0.2497 %)
  Biharmonic mixing of tracers, geopotentials ......      2293.830  ( 0.0350 %)
  Harmonic stress tensor, S-surfaces ...............      7354.066  ( 0.1121 %)
  Corrector time-step for 3D momentum ..............     25211.450  ( 0.3841 %)
  Corrector time-step for tracers ..................     23637.905  ( 0.3602 %)
                                              Total:    404272.563    6.1599

 Nonlinear model message Passage profile:

  Message Passage: 2D halo exchanges ...............     27410.131  ( 0.4176 %)
  Message Passage: 3D halo exchanges ...............     37022.628  ( 0.5641 %)
  Message Passage: 4D halo exchanges ...............     20618.869  ( 0.3142 %)
  Message Passage: data broadcast ..................      9122.282  ( 0.1390 %)
  Message Passage: data reduction ..................      1420.559  ( 0.0216 %)
  Message Passage: data gathering ..................      5002.919  ( 0.0762 %)
  Message Passage: data scattering..................       186.045  ( 0.0028 %)
  Message Passage: point data gathering ............   2627603.386  (40.0365 %)
                                              Total:   2728386.819   41.5721

 All percentages are with respect to total time =      6563021.672
Well, maybe that's a bad example. It is spending all the time communicating. Here's one that's "more normal":

Code: Select all

 Nonlinear ocean model elapsed time profile:

  Allocation and array initialization ..............        60.432  ( 0.0014 %)
  Ocean state initialization .......................         2.421  ( 0.0001 %)
  Reading of input data ............................     69839.815  ( 1.5856 %)
  Processing of input data .........................     41372.578  ( 0.9393 %)
  Processing of output time averaged data ..........     12087.382  ( 0.2744 %)
  Computation of vertical boundary conditions ......     28850.006  ( 0.6550 %)
  Computation of global information integrals ......     34279.083  ( 0.7782 %)
  Writing of output data ...........................    575421.435  (13.0637 %)
  Model 2D kernel ..................................    742172.451  (16.8494 %)
  Tidal forcing ....................................    133766.955  ( 3.0369 %)
  2D/3D coupling, vertical metrics .................     88575.049  ( 2.0109 %)
  Omega vertical velocity ..........................    102046.674  ( 2.3167 %)
  Equation of state for seawater ...................     80365.537  ( 1.8245 %)
  Atmosphere-Ocean bulk flux parameterization ......     22071.669  ( 0.5011 %)
  KPP vertical mixing parameterization .............    947214.453  (21.5044 %)
  3D equations right-side terms ....................     83269.923  ( 1.8905 %)
  3D equations predictor step ......................    222261.690  ( 5.0460 %)
  Pressure gradient ................................     52030.417  ( 1.1812 %)
  Harmonic stress tensor, S-surfaces ...............     53737.699  ( 1.2200 %)
  Corrector time-step for 3D momentum ..............    125120.118  ( 2.8406 %)
  Corrector time-step for tracers ..................    122006.579  ( 2.7699 %)
                                              Total:   3536552.365   80.2897

 Nonlinear sea-ice model elapsed time profile:

  Ice thermodynamics................................     47892.375  ( 1.0873 %)
  Ice rheology coefficients.........................    210882.943  ( 4.7876 %)
  Iterative solver of ice dynamics..................    532130.993  (12.0809 %)
  Advection of ice tracers..........................      8001.462  ( 0.1817 %)
                                              Total:    798907.773   18.1374

 Nonlinear model message Passage profile:

  Message Passage: 2D halo exchanges ...............   1260343.623  (28.6133 %)
  Message Passage: 3D halo exchanges ...............    469817.082  (10.6662 %)
  Message Passage: 4D halo exchanges ...............    173284.373  ( 3.9340 %)
  Message Passage: data broadcast ..................    370142.718  ( 8.4033 %)
  Message Passage: data reduction ..................     42863.424  ( 0.9731 %)
  Message Passage: data gathering ..................    176922.035  ( 4.0166 %)
  Message Passage: data scattering..................       110.137  ( 0.0025 %)
  Message Passage: boundary data gathering .........     41652.886  ( 0.9456 %)
  Message Passage: point data gathering ............     16230.007  ( 0.3685 %)
                                              Total:   2551366.286   57.9232

 All percentages are with respect to total time = 4404741.805
Now, maybe you aren't getting this report. The ROMS trunk code differs from mine by (at least):

Code: Select all

--- ../roms/trunk/ROMS/Utility/timers.F	2015-03-26 10:56:13.819755622 -0800
+++ ROMS/Utility/timers.F	2015-06-25 10:04:26.946864845 -0800
@@ -176,7 +176,7 @@
             END DO
           END IF
         END DO
-        IF (thread_count.eq.numthreads) THEN
+!        IF (thread_count.eq.numthreads) THEN
           thread_count=0
 #ifdef DISTRIBUTE
           op_handle(0:Nregion)='SUM'
@@ -269,7 +302,7 @@
   60      FORMAT (/,' All percentages are with respect to total time =',&
      &            5x,f12.3)
 #endif
-        END IF
+!        END IF
 !$OMP END CRITICAL (FINALIZE_WCLOCK)
       END IF
       RETURN
For some reason, the trunk code is set to only give this report when you run with two processes.

leonor
Posts: 2
Joined: Mon Apr 20, 2015 1:20 pm
Location: CENTEC

Re: Optimise performance

#3 Unread post by leonor »

Thanks for you quick reply.

No, in fact I don't have this report. This is very interesting, I will try to adapt this somehow.
I will try with OpenMP and also reduce the printed output, that might make it faster.

But I think my main question is why is ROMS only using about 12% of the CPU capacity? I am not running anything else on this machine. Should I have predefined or allocate something when installing the MPI? Is there a way to allocate more memory for ROMS to use? Or do you think this is just a problem with my machine?

Thank you!
Leonor

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Optimise performance

#4 Unread post by kate »

If you download a fresh ROMS-du-jour, it should have the report (better fix than mine). I don't know the answer to the rest.

Post Reply