#12: Changes for nests
Keywords:  nest

Comment (by george.gayno@…):

 The orography code is compiled with "-O0" on
 Cray] and with "-O3" on
 Theia].  I asked Jim about this difference and here is his response:

 We left out the -O3 compilation option for the Cray by mistake. It does
 run the
 code quite a bit faster. The answers do change but very little. Tom has
 the orography and it looks fine. I was going to suggest updating the trunk
 a couple of weeks to add the -O3 option on the Cray and also to put in
 necessary to build and run the codes on phase1/2.

 Jim also explained what changes he made to the code and provided some
 timing results:

 I thought you might like to know the changes I made to the orography code.
 Compiling the original code and the optimized code with -O0 (how it is set
 in the current trunk) yields identical answers. I profiled the code and
 wrote a simple timing routine to find where the code spent its time. There
 was some very expensive duplicate code:

      angle = spherical_angle(pnt0, pnt2, pnt1)
      anglesum = anglesum + spherical_angle(pnt0, pnt2, pnt1)

 The function spherical_angle is very expensive, so I replaced the above

      angle = spherical_angle(pnt0, pnt2, pnt1)
      anglesum = anglesum + angle

 Then, I threaded loops in 3 routines: MAKEMT2, MAKEPC2, MAKEOA2.

 Finally, I changed -O0 to -O3. This was the only time the answers changed
 the change is very small. The speed-up is very nice.

 Here are timings for the orography code for a C96, uniform case:

 tile    orig,-O0  Opt,-O0   Opt,-O0, 6threads  Opt,-O3,6threads
 1        386       311       93                   43
 2        381       307       97                   48
 3        1160      927       316                  173
 4        390       310       106                  43
 5        391       311       110                  45
 6        1159      917       306                  173

 Here are results from C768, uniform. For the regional work, we will need
 generate a 7th tile and experiment with nest boundaries to get the nest
 situated where we want it. I wrote a special driver wrapper script that
 the orography for tiles 1-4 simultaneously and the 5-7 simultaneously.
 greatly reduces our wall time. The following results illustrate this for
 C768 uniform.

 tile   orig,-O0   opt,-O3,6 threads
 1       874          77
 2       873          73
 3       2146         249
 4       895          76
 5       892          77
 6       1786         236

 So the original code using the default -O0 with no threading ran in 7466
 seconds (with the 7th tile add another 800 seconds or so). The optimized
 compile at -O3 with threading (run with 6 threads) using the wrapper
 finished in 485 seconds. Thus a speed-up of 15x.

