[Ncep.hmon] Fix for Rocoto's temporarily "unavailable" jobs

Zhan Zhang - NOAA Affiliate zhan.zhang at noaa.gov
Fri Apr 26 18:07:09 UTC 2019


Sam,

Please ignore my first question, I made a mistake in my cron job. But "qac"
is still very slow.
Thanks.

-Zhan

On Fri, Apr 26, 2019 at 10:20 AM Zhan Zhang - NOAA Affiliate <
zhan.zhang at noaa.gov> wrote:

> Sam,
>
> I am testing slurm on jets, and encountered two problems:
> 1. Each time (except for the initial time) when I submit rocoto job, the
> following question:
> "Rocoto cycles: <cycledef>201809010000 201809010000 06:00:00</cycledef>
> ALERT!
> /mnt/lfs3/projects/hwrf-vd/Zhan.Zhang/trunk_slurm/rocoto/hwrf-trunk_slurm-06L-2018090100.xml:
> XML file exists.  Overwrite (y/n)?"
>
> I tried both "rocoto/1.3.0-RC4" and
> "rocoto/1.3.0-RC4-morestates-longtimeout", they all behave the same.
>
> 2. The "qac" command responded very every slow (>30sec) after the system
> is switched to slurm.
>
> Thanks.
>
> -Zhan
>
> On Thu, Apr 25, 2019 at 4:56 PM Samuel Trahan - NOAA Affiliate <
> samuel.trahan at noaa.gov> wrote:
>
>> Raghu,
>>
>> I just submitted a ticket, RDHPCS#2019042554000248
>>
>> Sincerely,
>> Sam Trahan
>>
>> On Thu, 25 Apr 2019 at 16:52, <raghu.reddy at noaa.gov> wrote:
>>
>>> Hi Sam,
>>>
>>>
>>>
>>> Thank you for this information!
>>>
>>>
>>>
>>> Can you please let me know what is the exact command that is used by
>>> Rocoto that is causing this time out?
>>>
>>>
>>>
>>> Is it “scontrol show job …”?
>>>
>>>
>>>
>>> It will be useful to create stand alone tests (which you may already
>>> have).
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Raghu
>>>
>>>
>>>
>>> *From:* Samuel Trahan - NOAA Affiliate <samuel.trahan at noaa.gov>
>>> *Sent:* Thursday, April 25, 2019 4:39 PM
>>> *To:* NCEP.EMC.hwrf <NCEP.hwrf at noaa.gov>; _Ncep.hmon <ncep.hmon at noaa.gov
>>> >
>>> *Cc:* Ghassan Alaka - NOAA Affiliate <ghassan.alaka at noaa.gov>; Guoqing
>>> Ge - NOAA Affiliate <guoqing.ge at noaa.gov>; Christopher Harrop <
>>> Christopher.W.Harrop at noaa.gov>; Raghu Reddy <raghu.reddy at noaa.gov>
>>> *Subject:* Fix for Rocoto's temporarily "unavailable" jobs
>>>
>>>
>>>
>>> HWRF/HMON people,
>>>
>>>
>>>
>>> Recently, scontrol has sporadically taken longer than Rocoto's built-in
>>> limit of 30 seconds to run.  That leads to jobs being in an "unavailable"
>>> state until scontrol speeds up.  I have a modified version of Rocoto that
>>> has an 80 second timeout.  This fix is on top of the one that detects the
>>> "OUT_OF_MEMORY" state jobs.
>>>
>>>
>>>
>>> Please let us know if this fixes the problems:
>>>
>>>
>>>
>>> module use /mnt/lfs3/projects/hwrf-vd/soft/modulefiles/
>>>
>>> For RC4:     module load rocoto/1.3.0-RC4-morestates-longtimeout
>>>
>>> For RC3:     module load rocoto/1.3.0-RC3-morestates-longtimeout
>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.hmon/attachments/20190426/9914dfff/attachment.html 


More information about the Ncep.hmon mailing list