[Ncep.hmon] Fix for Rocoto's temporarily "unavailable" jobs
raghu.reddy at noaa.gov
raghu.reddy at noaa.gov
Thu Apr 25 20:52:51 UTC 2019
Hi Sam,
Thank you for this information!
Can you please let me know what is the exact command that is used by Rocoto that is causing this time out?
Is it “scontrol show job …”?
It will be useful to create stand alone tests (which you may already have).
Thanks!
Raghu
From: Samuel Trahan - NOAA Affiliate <samuel.trahan at noaa.gov>
Sent: Thursday, April 25, 2019 4:39 PM
To: NCEP.EMC.hwrf <NCEP.hwrf at noaa.gov>; _Ncep.hmon <ncep.hmon at noaa.gov>
Cc: Ghassan Alaka - NOAA Affiliate <ghassan.alaka at noaa.gov>; Guoqing Ge - NOAA Affiliate <guoqing.ge at noaa.gov>; Christopher Harrop <Christopher.W.Harrop at noaa.gov>; Raghu Reddy <raghu.reddy at noaa.gov>
Subject: Fix for Rocoto's temporarily "unavailable" jobs
HWRF/HMON people,
Recently, scontrol has sporadically taken longer than Rocoto's built-in limit of 30 seconds to run. That leads to jobs being in an "unavailable" state until scontrol speeds up. I have a modified version of Rocoto that has an 80 second timeout. This fix is on top of the one that detects the "OUT_OF_MEMORY" state jobs.
Please let us know if this fixes the problems:
module use /mnt/lfs3/projects/hwrf-vd/soft/modulefiles/
For RC4: module load rocoto/1.3.0-RC4-morestates-longtimeout
For RC3: module load rocoto/1.3.0-RC3-morestates-longtimeout
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.hmon/attachments/20190425/b5a33a60/attachment-0001.html
More information about the Ncep.hmon
mailing list