Dear community,
does any of you have experience with new Intel Xeon Phi arch and ROMS performance?
Thanks in advance for your time,
cheers
Ivica
Experience with new Intel Xenon Phi
Re: Experience with new Intel Xenon Phi
Hi Ivica,
I have some experience to share for ROMS on the Xeon Phi. I hope you find this useful. Do get back if you want more information.
I have been working on the standard ROMS "benchmark" application using the large ocean_benchmark3.in along with Intel. The workload that we used is at https://www.myroms.org/svn/src/tags/rom ... chmark3.in
The changes were mostly for improving the vectorization efficiency of the code on the Phi. We worked on the MPI build since we found that the OMP build suffers from the synch's that slow down the run considerably (almost 3x) compared to the MPI build.
1. Changes to align and pad the arrays so that the loops get vectorized efficiently.
2. The hotspot function was lmd_skpp.F and we did some code transformations (mostly loop splitting, loop interchange and changes to the conditionals) to improve vectorization and cache efficiency
3. Compiler switches to again produce more efficient code on the Phi
4. Use 2MB pages for TLB efficiency
With these changes, we were able to bring down the runtime for native from 153secs to 62secs.
Since this workload is too small to scale well, the grid size was doubled on both dimensions and for that workload we could improve on the performance of Ivybridge from 136secs to 109secs in symmetric mode.
I have some experience to share for ROMS on the Xeon Phi. I hope you find this useful. Do get back if you want more information.
I have been working on the standard ROMS "benchmark" application using the large ocean_benchmark3.in along with Intel. The workload that we used is at https://www.myroms.org/svn/src/tags/rom ... chmark3.in
The changes were mostly for improving the vectorization efficiency of the code on the Phi. We worked on the MPI build since we found that the OMP build suffers from the synch's that slow down the run considerably (almost 3x) compared to the MPI build.
1. Changes to align and pad the arrays so that the loops get vectorized efficiently.
2. The hotspot function was lmd_skpp.F and we did some code transformations (mostly loop splitting, loop interchange and changes to the conditionals) to improve vectorization and cache efficiency
3. Compiler switches to again produce more efficient code on the Phi
4. Use 2MB pages for TLB efficiency
With these changes, we were able to bring down the runtime for native from 153secs to 62secs.
Since this workload is too small to scale well, the grid size was doubled on both dimensions and for that workload we could improve on the performance of Ivybridge from 136secs to 109secs in symmetric mode.