[Ncep.list.fv3-announce] Upcoming NEMS Commit

Samuel Trahan - NOAA Affiliate samuel.trahan at noaa.gov
Mon Feb 11 18:48:14 UTC 2019


Hello all,

The commit is in.  Please let me know if you find any problems.

The compset logs will not reflect that fv3_lheatstrg has been run because I
ran it in a separate test, just before my prior email.

Sincerely,
Sam Trahan

On Mon, 11 Feb 2019 at 12:54, Samuel Trahan - NOAA Affiliate <
samuel.trahan at noaa.gov> wrote:

> Hello all,
>
> My commit is taking slightly longer than anticipated.  I'm predicting I'll
> finish by 2 PM.  All repositories except NEMSfv3gfs have been updated; I
> neglected to test the fv3_lheatstrg compset for that app with SLURM.  In
> the unlikely event that it fails, I'll disable that compset for SLURM and
> commit.
>
> Sincerely,
> Sam Trahan
>
> On Mon, 11 Feb 2019 at 10:43, Samuel Trahan - NOAA Affiliate <
> samuel.trahan at noaa.gov> wrote:
>
>> Hello all,
>>
>> The aforementioned commit will happen between about noon and 1 PM Eastern
>> time today.
>>
>> As a reminder, this commit will only affect you if:
>>
>> 1. You are running the multi-machine, multi-app, regression test system
>> that we use for huge NEMS commit and nightly tests, or
>>
>> 2. You are annoyed by the spurious error messages from the "make clean"
>> of FMS in the NEMSfv3gfs's compile.sh
>>
>> Sincerely,
>> Sam Trahan
>>
>> On Fri, 8 Feb 2019 at 15:11, Samuel Trahan - NOAA Affiliate <
>> samuel.trahan at noaa.gov> wrote:
>>
>>> Hello all,
>>>
>>> Update based on feedback and testing:
>>>
>>> 1. I have updated the SLURM vs. Torque logic so you will not have to
>>> specify your desired target on Theia.  The code now assumes you are using
>>> SLURM if "sbatch" is in your environment, and will assume you want
>>> Moab/Torque otherwise.  As long as the default environment on Jet and Theia
>>> doesn't change, this should be sufficient.
>>>
>>> 2. On Theia, all Moab/Torque compsets match baselines when run in SLURM.
>>>
>>> 3. On Jet, all Moab/Torque compsets match baselines when run in SLURM --
>>> except -- the fv3_wrtGauss_nemsio_c768 which hangs.  With that one, the FV3
>>> prints nothing, hangs forever, and an error message can be seen in the
>>> system logs suggesting a Mellanox firmware bug.  I submitted a ticket a few
>>> months ago and never heard back from admins.  Until this is fixed, that
>>> compset is disabled on uJet SLURM.  It is possible this problem is specific
>>> to uJet, not to SLURM.  The Moab/Torque tests are run on tJet right now,
>>> because all of uJet is reserved for SLURM.  While the t and u Jets are
>>> supposed to be identical, that isn't necessarily the case.  Soon, parts of
>>> xJet will be available to SLURM, and we may find the answer.  Note that
>>> this compset is the closest one to the operational configuration; it
>>> differs just in the physics selection.
>>>
>>> Sincerely,
>>> Sam Trahan
>>>
>>>
>>>
>>> On Fri, 8 Feb 2019 at 12:23, Samuel Trahan - NOAA Affiliate <
>>> samuel.trahan at noaa.gov> wrote:
>>>
>>>> Hi all,
>>>>
>>>> A NEMS master commit is coming in soon; this is a purely technical
>>>> one.  The NCEPLIBS-pyprodutil master and Supported apps' masters will be
>>>> updated as well.  The relevant branch is called "slurm" in NEMS and
>>>> NEMSfv3gfs; and "slurm-v2" in NCEPLIBS-pyprodutil.
>>>>
>>>> 1. SLURM support for NEMSfv3gfs app's NEMSCompsetRun on uJet and Theia
>>>> (see notes below).  Results match the Moab/Torque baselines.
>>>>
>>>> 2. Bug fix from Dusan Jovic to eliminate error messages when cleaning
>>>> FMS, and remove one temporary file created during the cleaning process.
>>>>
>>>> 3. Major bug fix to the multi-app test system to allow multiple,
>>>> multi-app, tests, to happen at the same time.  This bug was causing the
>>>> nightly test website to incorrectly report some branch-specific tests
>>>> people were doing as the nightly test results.  The change adds a "test id"
>>>> that is passed around; the nightly test uses "ngt".
>>>>
>>>>
>>>> SLURM porting details:
>>>>
>>>>
>>>> 1. From now on, when running NEMSfv3gfs NEMSCompsetRun on Theia, you
>>>> will have to specify whether you want a MOAB or SLURM test.  The
>>>> NEMSCompsetRun will complain if you don't.
>>>>
>>>> To run with Moab/Torque: NEMSCompsetRun --platform theia.intel ...
>>>> To run with SLURM: NEMSCompsetRun --platform theia.slurm.intel ...
>>>>
>>>> Once Moab/Torque are gone, the theia.slurm.intel will be removed, and
>>>> theia.intel will use SLURM.
>>>>
>>>> 2. On Jet, only uJet has SLURM.  We're expecting parts of xJet to be
>>>> SLURMified soon, at which point we can add that target.
>>>>
>>>> 3. On Theia, the SLURM is misconfigured to think there are only 12
>>>> cores per node instead of 24 when task geometries are requested.  I've
>>>> compensated by telling the nightly tests that there are only 12 cores per
>>>> node, which doubles the number of nodes we use.  To avoid pounding the
>>>> machine TOO hard, the Theia SLURM "nightly" tests will only run once a
>>>> week.  This can be changed once the admins fix the SLURM misconfiguration.
>>>>
>>>> 4. For now, we're putting the GAEA SLURM port on hold.  This is because
>>>> GAEA's SLURM configuration may be undergoing a major change in the near
>>>> future.  Presently it has a very non-standard configuration which would
>>>> require extra effort to support.  The new configuration may require very
>>>> different extra effort, and we don't want to do that twice.
>>>>
>>>> Sincerely,
>>>> Sam Trahan
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.list.fv3-announce/attachments/20190211/adbd97c6/attachment.html 


More information about the Ncep.list.fv3-announce mailing list