[Ncep.hmon] Fix for Rocoto's temporarily "unavailable" jobs

Samuel Trahan - NOAA Affiliate samuel.trahan at noaa.gov
Thu Apr 25 20:55:23 UTC 2019


Raghu,

I just submitted a ticket, RDHPCS#2019042554000248

Sincerely,
Sam Trahan

On Thu, 25 Apr 2019 at 16:52, <raghu.reddy at noaa.gov> wrote:

> Hi Sam,
>
>
>
> Thank you for this information!
>
>
>
> Can you please let me know what is the exact command that is used by
> Rocoto that is causing this time out?
>
>
>
> Is it “scontrol show job …”?
>
>
>
> It will be useful to create stand alone tests (which you may already have).
>
>
>
> Thanks!
>
> Raghu
>
>
>
> *From:* Samuel Trahan - NOAA Affiliate <samuel.trahan at noaa.gov>
> *Sent:* Thursday, April 25, 2019 4:39 PM
> *To:* NCEP.EMC.hwrf <NCEP.hwrf at noaa.gov>; _Ncep.hmon <ncep.hmon at noaa.gov>
> *Cc:* Ghassan Alaka - NOAA Affiliate <ghassan.alaka at noaa.gov>; Guoqing Ge
> - NOAA Affiliate <guoqing.ge at noaa.gov>; Christopher Harrop <
> Christopher.W.Harrop at noaa.gov>; Raghu Reddy <raghu.reddy at noaa.gov>
> *Subject:* Fix for Rocoto's temporarily "unavailable" jobs
>
>
>
> HWRF/HMON people,
>
>
>
> Recently, scontrol has sporadically taken longer than Rocoto's built-in
> limit of 30 seconds to run.  That leads to jobs being in an "unavailable"
> state until scontrol speeds up.  I have a modified version of Rocoto that
> has an 80 second timeout.  This fix is on top of the one that detects the
> "OUT_OF_MEMORY" state jobs.
>
>
>
> Please let us know if this fixes the problems:
>
>
>
> module use /mnt/lfs3/projects/hwrf-vd/soft/modulefiles/
>
> For RC4:     module load rocoto/1.3.0-RC4-morestates-longtimeout
>
> For RC3:     module load rocoto/1.3.0-RC3-morestates-longtimeout
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.hmon/attachments/20190425/5e73fb18/attachment.html 


More information about the Ncep.hmon mailing list