version 340 stop before reading in surface forcing
version 340 stop before reading in surface forcing
Hello, ROMS Modelers,
We are developing operational forecast system using ROMS under our environment.
I tried to run two different versions of ROMS on same supercomputer of IBM power 6 (AIX OS) with exactly same CPP options and same forcing files, The two versions are,
svn: $LastChangedRevision: 90 $
svn: $LastChangedDate: 2007-07-30 13:36:53 -0400 (Mon, 30 Jul 2007) $
svn: $Id: Version 90 2007-07-30 17:36:53Z arango $
svn: $LastChangedRevision: 340 $
svn: $LastChangedDate: 2009-03-26 01:57:53 +0000 (Thu, 26 Mar 2009) $
svn: $Id: Version 340 2009-03-26 01:57:53Z arango $
$LastChangedRevision: 340
The older version runs well, but the new version stoped somewhere. There is no error message in ROMS standard output file (see attachment TBOFS_nowcast_run_version340.out), but there is error message "Segmentation fault" (see attachment TBOFS_nowcast_run_version340.err).
From ROMS output, I think river forcing is read in successfully, but message of reading surface forcing did not print out. Any guidence is greatly appreciated.
We are developing operational forecast system using ROMS under our environment.
I tried to run two different versions of ROMS on same supercomputer of IBM power 6 (AIX OS) with exactly same CPP options and same forcing files, The two versions are,
svn: $LastChangedRevision: 90 $
svn: $LastChangedDate: 2007-07-30 13:36:53 -0400 (Mon, 30 Jul 2007) $
svn: $Id: Version 90 2007-07-30 17:36:53Z arango $
svn: $LastChangedRevision: 340 $
svn: $LastChangedDate: 2009-03-26 01:57:53 +0000 (Thu, 26 Mar 2009) $
svn: $Id: Version 340 2009-03-26 01:57:53Z arango $
$LastChangedRevision: 340
The older version runs well, but the new version stoped somewhere. There is no error message in ROMS standard output file (see attachment TBOFS_nowcast_run_version340.out), but there is error message "Segmentation fault" (see attachment TBOFS_nowcast_run_version340.err).
From ROMS output, I think river forcing is read in successfully, but message of reading surface forcing did not print out. Any guidence is greatly appreciated.
Re: version 340 stop before reading in surface forcing
cant seem to find your attachment log files.
Re: version 340 stop before reading in surface forcing
I cannot attach my files in the attachment. it says that "the extension err is not allowed" when I did the following,
1. browse the file I tried to attach
2. click on add the file
3. submit
I use to be able to do it. I forget how to do it now since it's been a while I did not use it.
Thanks.
1. browse the file I tried to attach
2. click on add the file
3. submit
I use to be able to do it. I forget how to do it now since it's been a while I did not use it.
Thanks.
Re: version 340 stop before reading in surface forcing
Make the extension txt or log
-
- Posts: 81
- Joined: Thu Dec 07, 2006 3:14 pm
- Location: USGS
- Contact:
Re: version 340 stop before reading in surface forcing
Do you define the river forcing with a netCDF file or are you using the ANA_PSOURCE routine?
Re: version 340 stop before reading in surface forcing
Yes. We are developing operating forecast systems (OFS), and all OFS have river forcing, tidal forcing, open boundary forcing, and atmospheric forcing from NetCDF files. For the same forcing files, the older version (v90) of ROMS runs well, but the newer version (v340) stops after reading river forcing. The ROMS log files are attached.
Thanks for any comment.
AJ
Thanks for any comment.
AJ
- Attachments
-
- TBOFS_nowcast_run_version340.out.log
- (46.35 KiB) Downloaded 381 times
-
- TBOFS_nowcast_run_version340.err.log
- (3.61 KiB) Downloaded 324 times
-
- TBOFS_nowcast_run_version90.out.log
- (397.6 KiB) Downloaded 333 times
-
- TBOFS_nowcast_run_version90.err.log
- (142 Bytes) Downloaded 377 times
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: version 340 stop before reading in surface forcing
The error message indicates that the master thread (task 0) has a segmentation fault during a parallel communication that terminated all nodes. Your model didn't die during reading the river forcing but reading u-wind component from file TBOFS_met_nowcast_2009072812.nc. The fact that we get all the statistics for reading river runoff salinity, tells you that ROMS finished reading the rivers correctly.
Newer versions of ROMS are very strict about CF compliance in input NetCDF files. We were not that strict in version 90 which is kind of dangerous because some incorrect assumptions may be made. I would check that wind NetCDF file for NaNs and CF compliance. Check the following post for details.
Now, your application is very small 174x288x11 and you are running on 96 nodes with a partition of 6x16. This is a complete overkill and not a wise use of computer resources. The overhead in parallel communication is such that I bet that this application runs much faster with much less parallel nodes. Personally, I will not run this application in more that 16 nodes
If this is an operational application, I would spend some time testing the optimal number of nodes and which parallel partition is faster. Always remember that their is an optimal combination of nodes and partition for efficient computations in ROMS. Also, if you have too many nodes you are at more risk that one of them will die or misbehave, and this may kill your run.
I have noticed this in the past, users that have a lot of computer nodes available tend to use them all in very small applications. I would use 96 nodes or more when the number of grid points are over 1 thousand
Newer versions of ROMS are very strict about CF compliance in input NetCDF files. We were not that strict in version 90 which is kind of dangerous because some incorrect assumptions may be made. I would check that wind NetCDF file for NaNs and CF compliance. Check the following post for details.
Now, your application is very small 174x288x11 and you are running on 96 nodes with a partition of 6x16. This is a complete overkill and not a wise use of computer resources. The overhead in parallel communication is such that I bet that this application runs much faster with much less parallel nodes. Personally, I will not run this application in more that 16 nodes
If this is an operational application, I would spend some time testing the optimal number of nodes and which parallel partition is faster. Always remember that their is an optimal combination of nodes and partition for efficient computations in ROMS. Also, if you have too many nodes you are at more risk that one of them will die or misbehave, and this may kill your run.
I have noticed this in the past, users that have a lot of computer nodes available tend to use them all in very small applications. I would use 96 nodes or more when the number of grid points are over 1 thousand
Re: version 340 stop before reading in surface forcing
Hernan,
Thanks a lot for your comments. There is no error message directly written from ROMS except error message of "task 0: Segmentation fault" in the IBM system runtime error file, so I do not know what's wrong with my surface forcing NetCDF file. The surface forcing file is attached, could you please take a look to see whether there is any problem in this forcing file? Again, it works for the version 90. I want to know what's difference between the two version.
Since we are developing operational forecast system on IBM supercomputer at NCEP/NOAA, all forcing files are automatically created for every forecast cycle. Therefore The forcing files created by the preprocess programs should work for current and later vsersions of ROMS if forecast system is updated.
In terms of nodes we are using, 96 processors might not be optimal our applications. We have not spent much time on the benchmark runs so far since we are still in developing phase now. We definitely need to conduct some test runs to find out what is the optimal node number for each of our operational forecast system.
Thanks again for your help
AJ
Thanks a lot for your comments. There is no error message directly written from ROMS except error message of "task 0: Segmentation fault" in the IBM system runtime error file, so I do not know what's wrong with my surface forcing NetCDF file. The surface forcing file is attached, could you please take a look to see whether there is any problem in this forcing file? Again, it works for the version 90. I want to know what's difference between the two version.
Since we are developing operational forecast system on IBM supercomputer at NCEP/NOAA, all forcing files are automatically created for every forecast cycle. Therefore The forcing files created by the preprocess programs should work for current and later vsersions of ROMS if forecast system is updated.
In terms of nodes we are using, 96 processors might not be optimal our applications. We have not spent much time on the benchmark runs so far since we are still in developing phase now. We definitely need to conduct some test runs to find out what is the optimal node number for each of our operational forecast system.
Thanks again for your help
AJ
- Attachments
-
- TBOFS_met_nowcast_2009072812.nc.txt
- (17.72 MiB) Downloaded 365 times
Re: version 340 stop before reading in surface forcing
Hernan,
I compared my atmospheric forcing file with frc_bulk.cdl in Data/ROMS/CDL directory. I found that there are more attributes (missing_value, field) for each of variable in my forcing file. Should this caused the problem?
Thanks
AJ
I compared my atmospheric forcing file with frc_bulk.cdl in Data/ROMS/CDL directory. I found that there are more attributes (missing_value, field) for each of variable in my forcing file. Should this caused the problem?
Thanks
AJ
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: version 340 stop before reading in surface forcing
I checked your NetCDF file and noticed that you have an interpolation problem in the bottom left corner in variables Uwind and Vwind. However, this is not your problem. It is always a good idea to plot the fields if there is a problem reading input NetCDF files. This should be the first step in diagnosing and debugging a problem.
The only thing that I noticed is that your units attribute for frc_time:
The correct specification of the standard is:
You have a missing value of -99999. Since this file is already in ROMS grid it must not have data with this missing value. ROMS only check for the _FillValue and set those values to zero during reading. The missing_value is problematic because it allows different data type than variable data type. It is not longer recommended. This attributed will become predicated in the future.
I noticed that you just have forcing data for days 207.125 to 208.625. So you are just running for just 1.5 days. I can run this grid size and time period on my laptop I will never run it on 96 processors at all. This absolutelly does not make sense. I will bet that the amount of overhead is such that a simple desktop with 4-cpus will go faster for this grid and time window.
Use large number of CPUs for large applications and/or large amount of time. I will recommend you to run first serially and then try a 2x2 and 2x4 partitions.
The only thing that I noticed is that your units attribute for frc_time:
Code: Select all
frc_time:units = "days since 2009 1 1 0" ;
Code: Select all
frc_time:units = "days since 2009-01-01 00:00:00" ;
Well, this is horrible starting choice. If you are still in developing mode you should start simple This application is so small that you can first run it serially with 1x1 partition. This will take MPI communications out the problem and allow you to focus and analyze your problem.In terms of nodes we are using, 96 processors might not be optimal our applications. We have not spent much time on the benchmark runs so far since we are still in developing phase now. We definitely need to conduct some test runs to find out what is the optimal node number for each of our operational forecast system.
I noticed that you just have forcing data for days 207.125 to 208.625. So you are just running for just 1.5 days. I can run this grid size and time period on my laptop I will never run it on 96 processors at all. This absolutelly does not make sense. I will bet that the amount of overhead is such that a simple desktop with 4-cpus will go faster for this grid and time window.
Use large number of CPUs for large applications and/or large amount of time. I will recommend you to run first serially and then try a 2x2 and 2x4 partitions.
Re: version 340 stop before reading in surface forcing
Hernan, you agree with you about to use computer resources efficiently. However, operational forecast is different from research. we normally required a cycle forecast be completed as soon as possible. A cycle nowcast/forecast includes: (1)pre-process of system check, grabbing and processing all real-time observation data and generating model forcing files; (2)model simulation; (3)post-process of model output, archives, graphics, and web site). On the other hand, we have to develop our OFS on NCEP's IBM computer. Normally I develop our OFS under semi-operational environment.arango wrote:I checked your NetCDF file and noticed that you have an interpolation problem in the bottom left corner in variables Uwind and Vwind. However, this is not your problem. It is always a good idea to plot the fields if there is a problem reading input NetCDF files. This should be the first step in diagnosing and debugging a problem.
The only thing that I noticed is that your units attribute for frc_time:The correct specification of the standard is:Code: Select all
frc_time:units = "days since 2009 1 1 0" ;
You have a missing value of -99999. Since this file is already in ROMS grid it must not have data with this missing value. ROMS only check for the _FillValue and set those values to zero during reading. The missing_value is problematic because it allows different data type than variable data type. It is not longer recommended. This attributed will become predicated in the future.Code: Select all
frc_time:units = "days since 2009-01-01 00:00:00" ;
I modified my atmospheric forcing file as Hernan suggested. However, same problems exist. I added some print statements into ROMS, and found out the problem occurred in subroutine "get_cycle" called by "get_2dfld" while reading Uwind from met. forcing file "TBOFS_met_nowcast_2009080418.nc" (see attachment). it looks like there is problem to find time variable in the met. file. Any idea about this? I searched ROMS forum web site, and found there are some issues regarding get_cycle while reading river forcing file.Well, this is horrible starting choice. If you are still in developing mode you should start simple This application is so small that you can first run it serially with 1x1 partition. This will take MPI communications out the problem and allow you to focus and analyze your problem.In terms of nodes we are using, 96 processors might not be optimal our applications. We have not spent much time on the benchmark runs so far since we are still in developing phase now. We definitely need to conduct some test runs to find out what is the optimal node number for each of our operational forecast system.
I noticed that you just have forcing data for days 207.125 to 208.625. So you are just running for just 1.5 days. I can run this grid size and time period on my laptop I will never run it on 96 processors at all. This absolutelly does not make sense. I will bet that the amount of overhead is such that a simple desktop with 4-cpus will go faster for this grid and time window.
Use large number of CPUs for large applications and/or large amount of time. I will recommend you to run first serially and then try a 2x2 and 2x4 partitions.
Thanks so much for your help.
- Attachments
-
- TBOFS_met_nowcast_2009080418.nc.txt
- (12.27 MiB) Downloaded 350 times
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: version 340 stop before reading in surface forcing
I gave you several suggestions to track and solve the problem that you are having but you refuse to do them. You don't seem to understand what are the steps that one need to take to fix this kind of problem. First, we need to run it in serial to make sure that it runs. If doesn't run in serial, it will fail in parallel. I give up. There is nothing that we can do to help you since you are not cooperating.
Many users are running this version and you are the only one haveing problems with it.
Many users are running this version and you are the only one haveing problems with it.
Re: version 340 stop before reading in surface forcing
Hernan, I understand what you are talking about. I did not post on this site doesn't mean I am not doing it. Nobody reports a bug also doesn't mean there is no bug in ROMS. I tried to solve my problem quicker by posting it on ROMS forum since I thought some ROMS experts would easily tell me what's wrong in my forcing files in terms of ROMS version changes.
As you mentioned, I ran serial ROMS, and added many print statements into ROMS. Finally I found where ROMS stopped, and I figure out what causes this problem for the newer version ROMS.
For the time variables, such as "frc_time" in met. forcing file and "zeta_time" in open boundary forcing file, I put an attribute called "base_date" for some time variables, there is an error in subroutine "netcdf_inq_var" while process this kind of time variables.
I take out "base_date" attribute from time variable in all forcing files, noe ROMS runs well. But I do not know why this attribute cause error for newer version of ROMS, but it works for version 90.
Thanks
As you mentioned, I ran serial ROMS, and added many print statements into ROMS. Finally I found where ROMS stopped, and I figure out what causes this problem for the newer version ROMS.
For the time variables, such as "frc_time" in met. forcing file and "zeta_time" in open boundary forcing file, I put an attribute called "base_date" for some time variables, there is an error in subroutine "netcdf_inq_var" while process this kind of time variables.
I take out "base_date" attribute from time variable in all forcing files, noe ROMS runs well. But I do not know why this attribute cause error for newer version of ROMS, but it works for version 90.
Thanks