[Ncep.list.fv3-announce] Upcoming NEMS Commit

Samuel Trahan - NOAA Affiliate samuel.trahan at noaa.gov
Mon Feb 11 15:43:39 UTC 2019


Hello all,

The aforementioned commit will happen between about noon and 1 PM Eastern
time today.

As a reminder, this commit will only affect you if:

1. You are running the multi-machine, multi-app, regression test system
that we use for huge NEMS commit and nightly tests, or

2. You are annoyed by the spurious error messages from the "make clean" of
FMS in the NEMSfv3gfs's compile.sh

Sincerely,
Sam Trahan

On Fri, 8 Feb 2019 at 15:11, Samuel Trahan - NOAA Affiliate <
samuel.trahan at noaa.gov> wrote:

> Hello all,
>
> Update based on feedback and testing:
>
> 1. I have updated the SLURM vs. Torque logic so you will not have to
> specify your desired target on Theia.  The code now assumes you are using
> SLURM if "sbatch" is in your environment, and will assume you want
> Moab/Torque otherwise.  As long as the default environment on Jet and Theia
> doesn't change, this should be sufficient.
>
> 2. On Theia, all Moab/Torque compsets match baselines when run in SLURM.
>
> 3. On Jet, all Moab/Torque compsets match baselines when run in SLURM --
> except -- the fv3_wrtGauss_nemsio_c768 which hangs.  With that one, the FV3
> prints nothing, hangs forever, and an error message can be seen in the
> system logs suggesting a Mellanox firmware bug.  I submitted a ticket a few
> months ago and never heard back from admins.  Until this is fixed, that
> compset is disabled on uJet SLURM.  It is possible this problem is specific
> to uJet, not to SLURM.  The Moab/Torque tests are run on tJet right now,
> because all of uJet is reserved for SLURM.  While the t and u Jets are
> supposed to be identical, that isn't necessarily the case.  Soon, parts of
> xJet will be available to SLURM, and we may find the answer.  Note that
> this compset is the closest one to the operational configuration; it
> differs just in the physics selection.
>
> Sincerely,
> Sam Trahan
>
>
>
> On Fri, 8 Feb 2019 at 12:23, Samuel Trahan - NOAA Affiliate <
> samuel.trahan at noaa.gov> wrote:
>
>> Hi all,
>>
>> A NEMS master commit is coming in soon; this is a purely technical one.
>> The NCEPLIBS-pyprodutil master and Supported apps' masters will be updated
>> as well.  The relevant branch is called "slurm" in NEMS and NEMSfv3gfs; and
>> "slurm-v2" in NCEPLIBS-pyprodutil.
>>
>> 1. SLURM support for NEMSfv3gfs app's NEMSCompsetRun on uJet and Theia
>> (see notes below).  Results match the Moab/Torque baselines.
>>
>> 2. Bug fix from Dusan Jovic to eliminate error messages when cleaning
>> FMS, and remove one temporary file created during the cleaning process.
>>
>> 3. Major bug fix to the multi-app test system to allow multiple,
>> multi-app, tests, to happen at the same time.  This bug was causing the
>> nightly test website to incorrectly report some branch-specific tests
>> people were doing as the nightly test results.  The change adds a "test id"
>> that is passed around; the nightly test uses "ngt".
>>
>>
>> SLURM porting details:
>>
>>
>> 1. From now on, when running NEMSfv3gfs NEMSCompsetRun on Theia, you will
>> have to specify whether you want a MOAB or SLURM test.  The NEMSCompsetRun
>> will complain if you don't.
>>
>> To run with Moab/Torque: NEMSCompsetRun --platform theia.intel ...
>> To run with SLURM: NEMSCompsetRun --platform theia.slurm.intel ...
>>
>> Once Moab/Torque are gone, the theia.slurm.intel will be removed, and
>> theia.intel will use SLURM.
>>
>> 2. On Jet, only uJet has SLURM.  We're expecting parts of xJet to be
>> SLURMified soon, at which point we can add that target.
>>
>> 3. On Theia, the SLURM is misconfigured to think there are only 12 cores
>> per node instead of 24 when task geometries are requested.  I've
>> compensated by telling the nightly tests that there are only 12 cores per
>> node, which doubles the number of nodes we use.  To avoid pounding the
>> machine TOO hard, the Theia SLURM "nightly" tests will only run once a
>> week.  This can be changed once the admins fix the SLURM misconfiguration.
>>
>> 4. For now, we're putting the GAEA SLURM port on hold.  This is because
>> GAEA's SLURM configuration may be undergoing a major change in the near
>> future.  Presently it has a very non-standard configuration which would
>> require extra effort to support.  The new configuration may require very
>> different extra effort, and we don't want to do that twice.
>>
>> Sincerely,
>> Sam Trahan
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.list.fv3-announce/attachments/20190211/7593852f/attachment.html 


More information about the Ncep.list.fv3-announce mailing list