[Ncep.hmon] Fix for Rocoto's temporarily "unavailable" jobs

Zhan Zhang - NOAA Affiliate zhan.zhang at noaa.gov
Fri Apr 26 14:20:37 UTC 2019


Sam,

I am testing slurm on jets, and encountered two problems:
1. Each time (except for the initial time) when I submit rocoto job, the
following question:
"Rocoto cycles: <cycledef>201809010000 201809010000 06:00:00</cycledef>
ALERT!
/mnt/lfs3/projects/hwrf-vd/Zhan.Zhang/trunk_slurm/rocoto/hwrf-trunk_slurm-06L-2018090100.xml:
XML file exists.  Overwrite (y/n)?"

I tried both "rocoto/1.3.0-RC4" and
"rocoto/1.3.0-RC4-morestates-longtimeout", they all behave the same.

2. The "qac" command responded very every slow (>30sec) after the system is
switched to slurm.

Thanks.

-Zhan

On Thu, Apr 25, 2019 at 4:56 PM Samuel Trahan - NOAA Affiliate <
samuel.trahan at noaa.gov> wrote:

> Raghu,
>
> I just submitted a ticket, RDHPCS#2019042554000248
>
> Sincerely,
> Sam Trahan
>
> On Thu, 25 Apr 2019 at 16:52, <raghu.reddy at noaa.gov> wrote:
>
>> Hi Sam,
>>
>>
>>
>> Thank you for this information!
>>
>>
>>
>> Can you please let me know what is the exact command that is used by
>> Rocoto that is causing this time out?
>>
>>
>>
>> Is it “scontrol show job …”?
>>
>>
>>
>> It will be useful to create stand alone tests (which you may already
>> have).
>>
>>
>>
>> Thanks!
>>
>> Raghu
>>
>>
>>
>> *From:* Samuel Trahan - NOAA Affiliate <samuel.trahan at noaa.gov>
>> *Sent:* Thursday, April 25, 2019 4:39 PM
>> *To:* NCEP.EMC.hwrf <NCEP.hwrf at noaa.gov>; _Ncep.hmon <ncep.hmon at noaa.gov>
>> *Cc:* Ghassan Alaka - NOAA Affiliate <ghassan.alaka at noaa.gov>; Guoqing
>> Ge - NOAA Affiliate <guoqing.ge at noaa.gov>; Christopher Harrop <
>> Christopher.W.Harrop at noaa.gov>; Raghu Reddy <raghu.reddy at noaa.gov>
>> *Subject:* Fix for Rocoto's temporarily "unavailable" jobs
>>
>>
>>
>> HWRF/HMON people,
>>
>>
>>
>> Recently, scontrol has sporadically taken longer than Rocoto's built-in
>> limit of 30 seconds to run.  That leads to jobs being in an "unavailable"
>> state until scontrol speeds up.  I have a modified version of Rocoto that
>> has an 80 second timeout.  This fix is on top of the one that detects the
>> "OUT_OF_MEMORY" state jobs.
>>
>>
>>
>> Please let us know if this fixes the problems:
>>
>>
>>
>> module use /mnt/lfs3/projects/hwrf-vd/soft/modulefiles/
>>
>> For RC4:     module load rocoto/1.3.0-RC4-morestates-longtimeout
>>
>> For RC3:     module load rocoto/1.3.0-RC3-morestates-longtimeout
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.hmon/attachments/20190426/9cc89856/attachment.html 


More information about the Ncep.hmon mailing list