Further to my earlier email, I have found that when using ifort to compile and run ROMS/TOMS 2.1 there appears to be an array size limitation. For example, if I were to compile and run the UPWELLING model problem, it runs successfully for the default arrays sizes of Lm=41, Mm=80, N=16 (in mod_param.F). If however, I increase these parameters to Lm=252, Mm=296, N=30 (which is the size of one of my ROMS/TOMS applications), I get a "Segmentation fault" error when running the code and it occurs in the pre_step3d_tile routine and in particular in the statement:
real(r8), dimension(PRIVATE_2D_SCRATCH_ARRAY,0:N(ng)) :: swdk
I did not get such an error with ifc and I have successfully compiled and run my application with Lm=252, Mm=296, N=30.
Therefore, there appears to be a problem when using ifort. I compile and run on a Dell machine using Red Hat Linux (Fedora 2.0) and the ifort 8.1 compiler. I am sure that my machine can handle ROMS/TOMS applications as large as this and if not larger without any problems.
Does anybody know why this happens and a possible solution? I wonder whether others are able to compile (with ifort) and run the ROMS/TOMS UPWELLING model problem with Lm=252, Mm=296, N=30? Please let me know what you think.
Thank you.
ROMS/TOMS 2.1 array size limitation problem
Solutions to the problem
Dear Lanerolle,
I encounter exactly the same problem. I managed to pinpoint the bug to
the following minimal program:
The bug happens for ifort 9.0, ifort 8.1 but not for ifc 7.1
It does not happen if 1044574 is replaced by a smaller
value and it does not happen if Jend is replaced by its
value 1044574 in the declaration of the variable var1.
Asking on the Intel Fortran compiler forums, I was told
that the problem is the stacksize and the proposed
solution consist of replacing the line
by following couple of lines
Another solution, which allows to bypass the corresponding
huge code rewrite consists of putting in one's .zshrc
I encounter exactly the same problem. I managed to pinpoint the bug to
the following minimal program:
Code: Select all
PROGRAM BUGSEARCH
CALL gls_corstep_tile(1044574)
CONTAINS
SUBROUTINE gls_corstep_tile(Jend)
integer, intent(in) :: Jend
integer, parameter :: r8 = selected_real_kind(12,300)
real(r8), dimension(Jend) :: var1
Print *, 'The program has finished'
END SUBROUTINE gls_corstep_tile
END PROGRAM
It does not happen if 1044574 is replaced by a smaller
value and it does not happen if Jend is replaced by its
value 1044574 in the declaration of the variable var1.
Asking on the Intel Fortran compiler forums, I was told
that the problem is the stacksize and the proposed
solution consist of replacing the line
Code: Select all
real(r8), dimension(Jend) :: var1
Code: Select all
real(r8), allocatable, dimension(:) :: var1
allocate (var1(jend))
huge code rewrite consists of putting in one's .zshrc
Code: Select all
unlimit stacksize
- m.hadfield
- Posts: 521
- Joined: Tue Jul 01, 2003 4:12 am
- Location: NIWA
As Mathieu has pointed out, Intel Fortran is a heavy user of stack space. Short of re-coding ROMS (which I advise against) there are 2 things you can do (and I suggest doing both):
- Increase the stack space limit imposed by the shell or OS. Details of how to do this differ between systems, but under the bash shell on Linux the command is "ulimit -S -s 65536". This increases the stack size limit from the default of 8192 kiB (8 MiB) to 65536 kiB (64 MiB). The "-S" switch means that this is a soft limit, which can be exceeded in subsequent calls to ulimit. Run this command before starting ROMS; once you're happy with it put it in a startup script
Increase the values of the ROMS input variables NtileI and NtileJ. Automatic arrays (the ones declared inside subroutines and allocated memory automatically at run time) tend to have dimensions proportional to the tile size, so reducing tile size reduces demands on stack space. It also tends to speed ROMS up. On Intel CPUs ROMS tends to run fastest when NtileJ > NtileI, ie when the tiles are wide
Dear Mark, and everybody,
The problem, as already pointed out, is related with stack size limitation and may
be fixed in many cases if one can increase or even unlimit stacksize. This, however,
is not as innocent as it might sound.
In older days (ROMS 1.9 and earlier) scratch arrays were pre-allocated into
[THREADPRIVATE] common blocks and were passed as arguments into physical
routines. This eliminates the use of automatic arrays completely, and excludes situations
like this. It also saves some time because allocation-deallocation actually takes some
time, and depending on compiler/operating system, may cause noticeable performance
degradation.
In 2.x codes this mechanism was abandoned for unknown reason.
Starting with version 8.0, Intel compilers use different mechanism of handling
allocation of automatic arrays, resulting in better performance, but at the same time
facing limitations. It cannot be illustrated using 2.x codes without significant rewrite,
but can be shown easily using 1.9 codes: the scratch arrays are passed as
arguments; however this is mainly for optimization purposes and not necessary from
mathematical point of view. You can comment out these arguments in almost
all cases and the model produces exactly the same result, however arrays are now
automatic. If you use 7.1 Intel compiler on 2.4.x Linux kernel, you may observe the
code running 30% slower than when the arrays were passed as arguments.
If you switch to 8.1 compiler, the performance degradation is significantly less than
than, up to not noticeable at all.
The problem, as already pointed out, is related with stack size limitation and may
be fixed in many cases if one can increase or even unlimit stacksize. This, however,
is not as innocent as it might sound.
In older days (ROMS 1.9 and earlier) scratch arrays were pre-allocated into
[THREADPRIVATE] common blocks and were passed as arguments into physical
routines. This eliminates the use of automatic arrays completely, and excludes situations
like this. It also saves some time because allocation-deallocation actually takes some
time, and depending on compiler/operating system, may cause noticeable performance
degradation.
In 2.x codes this mechanism was abandoned for unknown reason.
Starting with version 8.0, Intel compilers use different mechanism of handling
allocation of automatic arrays, resulting in better performance, but at the same time
facing limitations. It cannot be illustrated using 2.x codes without significant rewrite,
but can be shown easily using 1.9 codes: the scratch arrays are passed as
arguments; however this is mainly for optimization purposes and not necessary from
mathematical point of view. You can comment out these arguments in almost
all cases and the model produces exactly the same result, however arrays are now
automatic. If you use 7.1 Intel compiler on 2.4.x Linux kernel, you may observe the
code running 30% slower than when the arrays were passed as arguments.
If you switch to 8.1 compiler, the performance degradation is significantly less than
than, up to not noticeable at all.