[Ncep.list.fv3-announce] Upcoming NEMS Commit

Samuel Trahan - NOAA Affiliate samuel.trahan at noaa.gov
Fri Feb 8 20:11:13 UTC 2019


Hello all,

Update based on feedback and testing:

1. I have updated the SLURM vs. Torque logic so you will not have to
specify your desired target on Theia.  The code now assumes you are using
SLURM if "sbatch" is in your environment, and will assume you want
Moab/Torque otherwise.  As long as the default environment on Jet and Theia
doesn't change, this should be sufficient.

2. On Theia, all Moab/Torque compsets match baselines when run in SLURM.

3. On Jet, all Moab/Torque compsets match baselines when run in SLURM --
except -- the fv3_wrtGauss_nemsio_c768 which hangs.  With that one, the FV3
prints nothing, hangs forever, and an error message can be seen in the
system logs suggesting a Mellanox firmware bug.  I submitted a ticket a few
months ago and never heard back from admins.  Until this is fixed, that
compset is disabled on uJet SLURM.  It is possible this problem is specific
to uJet, not to SLURM.  The Moab/Torque tests are run on tJet right now,
because all of uJet is reserved for SLURM.  While the t and u Jets are
supposed to be identical, that isn't necessarily the case.  Soon, parts of
xJet will be available to SLURM, and we may find the answer.  Note that
this compset is the closest one to the operational configuration; it
differs just in the physics selection.

Sincerely,
Sam Trahan



On Fri, 8 Feb 2019 at 12:23, Samuel Trahan - NOAA Affiliate <
samuel.trahan at noaa.gov> wrote:

> Hi all,
>
> A NEMS master commit is coming in soon; this is a purely technical one.
> The NCEPLIBS-pyprodutil master and Supported apps' masters will be updated
> as well.  The relevant branch is called "slurm" in NEMS and NEMSfv3gfs; and
> "slurm-v2" in NCEPLIBS-pyprodutil.
>
> 1. SLURM support for NEMSfv3gfs app's NEMSCompsetRun on uJet and Theia
> (see notes below).  Results match the Moab/Torque baselines.
>
> 2. Bug fix from Dusan Jovic to eliminate error messages when cleaning FMS,
> and remove one temporary file created during the cleaning process.
>
> 3. Major bug fix to the multi-app test system to allow multiple,
> multi-app, tests, to happen at the same time.  This bug was causing the
> nightly test website to incorrectly report some branch-specific tests
> people were doing as the nightly test results.  The change adds a "test id"
> that is passed around; the nightly test uses "ngt".
>
>
> SLURM porting details:
>
>
> 1. From now on, when running NEMSfv3gfs NEMSCompsetRun on Theia, you will
> have to specify whether you want a MOAB or SLURM test.  The NEMSCompsetRun
> will complain if you don't.
>
> To run with Moab/Torque: NEMSCompsetRun --platform theia.intel ...
> To run with SLURM: NEMSCompsetRun --platform theia.slurm.intel ...
>
> Once Moab/Torque are gone, the theia.slurm.intel will be removed, and
> theia.intel will use SLURM.
>
> 2. On Jet, only uJet has SLURM.  We're expecting parts of xJet to be
> SLURMified soon, at which point we can add that target.
>
> 3. On Theia, the SLURM is misconfigured to think there are only 12 cores
> per node instead of 24 when task geometries are requested.  I've
> compensated by telling the nightly tests that there are only 12 cores per
> node, which doubles the number of nodes we use.  To avoid pounding the
> machine TOO hard, the Theia SLURM "nightly" tests will only run once a
> week.  This can be changed once the admins fix the SLURM misconfiguration.
>
> 4. For now, we're putting the GAEA SLURM port on hold.  This is because
> GAEA's SLURM configuration may be undergoing a major change in the near
> future.  Presently it has a very non-standard configuration which would
> require extra effort to support.  The new configuration may require very
> different extra effort, and we don't want to do that twice.
>
> Sincerely,
> Sam Trahan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.list.fv3-announce/attachments/20190208/83632227/attachment.html 


More information about the Ncep.list.fv3-announce mailing list