[Ncep.list.emc.fv3gfs_tickets] FV3GFS Ticket #12: Changes for nests

Wed Apr 5 12:11:54 UTC 2017

#12: Changes for nests
--------------------------+----------------------------
  Reporter:  tom.black@…  |      Owner:  george.gayno@…
      Type:  enhancement  |     Status:  accepted
  Priority:  major        |  Milestone:
 Component:  component1   |    Version:
Resolution:               |   Keywords:  nest
--------------------------+----------------------------

Comment (by george.gayno@…):

 The orography code is compiled with "-O0" on
 [https://svnemc.ncep.noaa.gov/trac/fv3gfs/browser/branches/black/fv3gfs/global_shared.v15.0.0/sorc/orog.fd/makefile.sh_cray?rev=90792
 Cray] and with "-O3" on
 [https://svnemc.ncep.noaa.gov/trac/fv3gfs/browser/branches/black/fv3gfs/global_shared.v15.0.0/sorc/orog.fd/makefile.sh_theia?rev=90792
 Theia].  I asked Jim about this difference and here is his response:

 {{{
 We left out the -O3 compilation option for the Cray by mistake. It does
 run the
 code quite a bit faster. The answers do change but very little. Tom has
 checked
 the orography and it looks fine. I was going to suggest updating the trunk
 in
 a couple of weeks to add the -O3 option on the Cray and also to put in
 files
 necessary to build and run the codes on phase1/2.
 }}}

 Jim also explained what changes he made to the code and provided some
 timing results:

 {{{
 I thought you might like to know the changes I made to the orography code.
 Compiling the original code and the optimized code with -O0 (how it is set
 in the current trunk) yields identical answers. I profiled the code and
 also
 wrote a simple timing routine to find where the code spent its time. There
 was some very expensive duplicate code:

      angle = spherical_angle(pnt0, pnt2, pnt1)
      anglesum = anglesum + spherical_angle(pnt0, pnt2, pnt1)

 The function spherical_angle is very expensive, so I replaced the above
 code
 with:

      angle = spherical_angle(pnt0, pnt2, pnt1)
      anglesum = anglesum + angle

 Then, I threaded loops in 3 routines: MAKEMT2, MAKEPC2, MAKEOA2.

 Finally, I changed -O0 to -O3. This was the only time the answers changed
 and
 the change is very small. The speed-up is very nice.

 Here are timings for the orography code for a C96, uniform case:

 tile    orig,-O0  Opt,-O0   Opt,-O0, 6threads  Opt,-O3,6threads
 1        386       311       93                   43
 2        381       307       97                   48
 3        1160      927       316                  173
 4        390       310       106                  43
 5        391       311       110                  45
 6        1159      917       306                  173

 Here are results from C768, uniform. For the regional work, we will need
 to
 generate a 7th tile and experiment with nest boundaries to get the nest
 situated where we want it. I wrote a special driver wrapper script that
 runs
 the orography for tiles 1-4 simultaneously and the 5-7 simultaneously.
 This
 greatly reduces our wall time. The following results illustrate this for
 the
 C768 uniform.

 tile   orig,-O0   opt,-O3,6 threads
 1       874          77
 2       873          73
 3       2146         249
 4       895          76
 5       892          77
 6       1786         236

 So the original code using the default -O0 with no threading ran in 7466
 seconds (with the 7th tile add another 800 seconds or so). The optimized
 code
 compile at -O3 with threading (run with 6 threads) using the wrapper
 script
 finished in 485 seconds. Thus a speed-up of 15x.
 }}}

--
Ticket URL: <https://svnemc.ncep.noaa.gov/trac/fv3gfs/ticket/12#comment:8>
fv3gfs <https://svnemc.ncep.noaa.gov/trac/fv3gfs>
NGGPS FV3GFS Development