[Ncep.hmon] Fix for Rocoto's temporarily "unavailable" jobs

Samuel Trahan - NOAA Affiliate samuel.trahan at noaa.gov
Fri Apr 26 18:08:41 UTC 2019


Zhan,

I haven't updated the "qac" suite of tools for squeue yet.  I was focused
on getting Rocoto working because all workflows will break without that.

Sincerely,
Sam Trahan

On Fri, 26 Apr 2019 at 14:07, Zhan Zhang - NOAA Affiliate <
zhan.zhang at noaa.gov> wrote:

> Sam,
>
> Please ignore my first question, I made a mistake in my cron job. But
> "qac" is still very slow.
> Thanks.
>
> -Zhan
>
> On Fri, Apr 26, 2019 at 10:20 AM Zhan Zhang - NOAA Affiliate <
> zhan.zhang at noaa.gov> wrote:
>
>> Sam,
>>
>> I am testing slurm on jets, and encountered two problems:
>> 1. Each time (except for the initial time) when I submit rocoto job, the
>> following question:
>> "Rocoto cycles: <cycledef>201809010000 201809010000 06:00:00</cycledef>
>> ALERT!
>> /mnt/lfs3/projects/hwrf-vd/Zhan.Zhang/trunk_slurm/rocoto/hwrf-trunk_slurm-06L-2018090100.xml:
>> XML file exists.  Overwrite (y/n)?"
>>
>> I tried both "rocoto/1.3.0-RC4" and
>> "rocoto/1.3.0-RC4-morestates-longtimeout", they all behave the same.
>>
>> 2. The "qac" command responded very every slow (>30sec) after the system
>> is switched to slurm.
>>
>> Thanks.
>>
>> -Zhan
>>
>> On Thu, Apr 25, 2019 at 4:56 PM Samuel Trahan - NOAA Affiliate <
>> samuel.trahan at noaa.gov> wrote:
>>
>>> Raghu,
>>>
>>> I just submitted a ticket, RDHPCS#2019042554000248
>>>
>>> Sincerely,
>>> Sam Trahan
>>>
>>> On Thu, 25 Apr 2019 at 16:52, <raghu.reddy at noaa.gov> wrote:
>>>
>>>> Hi Sam,
>>>>
>>>>
>>>>
>>>> Thank you for this information!
>>>>
>>>>
>>>>
>>>> Can you please let me know what is the exact command that is used by
>>>> Rocoto that is causing this time out?
>>>>
>>>>
>>>>
>>>> Is it “scontrol show job …”?
>>>>
>>>>
>>>>
>>>> It will be useful to create stand alone tests (which you may already
>>>> have).
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> Raghu
>>>>
>>>>
>>>>
>>>> *From:* Samuel Trahan - NOAA Affiliate <samuel.trahan at noaa.gov>
>>>> *Sent:* Thursday, April 25, 2019 4:39 PM
>>>> *To:* NCEP.EMC.hwrf <NCEP.hwrf at noaa.gov>; _Ncep.hmon <
>>>> ncep.hmon at noaa.gov>
>>>> *Cc:* Ghassan Alaka - NOAA Affiliate <ghassan.alaka at noaa.gov>; Guoqing
>>>> Ge - NOAA Affiliate <guoqing.ge at noaa.gov>; Christopher Harrop <
>>>> Christopher.W.Harrop at noaa.gov>; Raghu Reddy <raghu.reddy at noaa.gov>
>>>> *Subject:* Fix for Rocoto's temporarily "unavailable" jobs
>>>>
>>>>
>>>>
>>>> HWRF/HMON people,
>>>>
>>>>
>>>>
>>>> Recently, scontrol has sporadically taken longer than Rocoto's built-in
>>>> limit of 30 seconds to run.  That leads to jobs being in an "unavailable"
>>>> state until scontrol speeds up.  I have a modified version of Rocoto that
>>>> has an 80 second timeout.  This fix is on top of the one that detects the
>>>> "OUT_OF_MEMORY" state jobs.
>>>>
>>>>
>>>>
>>>> Please let us know if this fixes the problems:
>>>>
>>>>
>>>>
>>>> module use /mnt/lfs3/projects/hwrf-vd/soft/modulefiles/
>>>>
>>>> For RC4:     module load rocoto/1.3.0-RC4-morestates-longtimeout
>>>>
>>>> For RC3:     module load rocoto/1.3.0-RC3-morestates-longtimeout
>>>>
>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://www.lstsrv.ncep.noaa.gov/pipermail/ncep.hmon/attachments/20190426/01740e6f/attachment-0001.html 


More information about the Ncep.hmon mailing list