[Ncep.hmon] Fix for Rocoto's temporarily "unavailable" jobs

raghu.reddy at noaa.gov raghu.reddy at noaa.gov
Thu Apr 25 20:52:51 UTC 2019


Hi Sam,

 

Thank you for this information!  

 

Can you please let me know what is the exact command that is used by Rocoto that is causing this time out?

 

Is it “scontrol show job …”?

 

It will be useful to create stand alone tests (which you may already have).

 

Thanks!

Raghu

 

From: Samuel Trahan - NOAA Affiliate <samuel.trahan at noaa.gov> 
Sent: Thursday, April 25, 2019 4:39 PM
To: NCEP.EMC.hwrf <NCEP.hwrf at noaa.gov>; _Ncep.hmon <ncep.hmon at noaa.gov>
Cc: Ghassan Alaka - NOAA Affiliate <ghassan.alaka at noaa.gov>; Guoqing Ge - NOAA Affiliate <guoqing.ge at noaa.gov>; Christopher Harrop <Christopher.W.Harrop at noaa.gov>; Raghu Reddy <raghu.reddy at noaa.gov>
Subject: Fix for Rocoto's temporarily "unavailable" jobs

 

HWRF/HMON people,

 

Recently, scontrol has sporadically taken longer than Rocoto's built-in limit of 30 seconds to run.  That leads to jobs being in an "unavailable" state until scontrol speeds up.  I have a modified version of Rocoto that has an 80 second timeout.  This fix is on top of the one that detects the "OUT_OF_MEMORY" state jobs.

 

Please let us know if this fixes the problems:

 

module use /mnt/lfs3/projects/hwrf-vd/soft/modulefiles/

For RC4:     module load rocoto/1.3.0-RC4-morestates-longtimeout

For RC3:     module load rocoto/1.3.0-RC3-morestates-longtimeout

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.hmon/attachments/20190425/b5a33a60/attachment-0001.html 


More information about the Ncep.hmon mailing list