No speedup using Intel MKL libraries?
Alfio Lazzaro
alfio.... at gmail.com
Tue Nov 7 15:10:08 UTC 2017
Well, the sequential version of MKL is somehow the best choice since it can
be that we call MKL functions within OpenMP threads in CP2K. I must say
that this is was an old solution because of safety, nowadays MKL is able to
dynamically change the number of threads (see
http://scc.ustc.edu.cn/zlsc/tc4600/intel/2016.0.109/mkl/common/mkl_userguide/GUID-0332F698-1696-4E9F-A908-EA4B1484D114.htm
), so it should be fine to use the threaded version.
Alfio
Il giorno martedì 7 novembre 2017 15:40:27 UTC+1, Faraz H ha scritto:
>
> Thanks Alfio for looking deeper into it! I ran more tests and you are
> indeed correct; i.e the MKL functions are running serially ( one cpu ). But
> the rest of the code is using all cpus of the machine. In the local.ssmp
> makefile I saw that libmkl_sequential.a is linked. So I changed it to
> libmkl_intel_thread.a and libiomp5.a . Now the benchmark H20-128 runs in 5
> minutes on 28 cpu machine compared to 7 minutes for the non-MKL linked
> executable.
>
> I wonder if it is a bug that the toolchain links the serial MKL when
> creating the local.ssmp makefiles? In what situation would someone want the
> sequential MKL libraries linked instead of the parallel ones for ssmp ?
>
>
> On Monday, November 6, 2017 at 7:11:53 AM UTC-5, Alfio Lazzaro wrote:
>>
>> Dear Farah,
>> OK, this is the comparison of the two runs for functions where I see the
>> highest timing discrepancy (time in seconds, second column w/ MKL, third
>> column w/o MKL)
>>
>> dbcsr_make_untransposed_blocks 4.139 1.591
>> cp_fm_gemm 5.691 1.087
>> setup_rec_index_2d 6.330 1.741
>> cp_fm_cholesky_decompose 11.539 1.703
>> cp_fm_cholesky_invert 26.048 3.031
>>
>> Well, personally I don't understand the differences in the 1st and 3rd
>> line, likely it was a fluctuation.
>> For the other lines, these are MKL related (DGEMM and
>> Cholesky decomposition). My suspicious is that you are using MKL in
>> sequential, while Openblas is somehow using threads. A way to test it is to
>> run with a single thread (or less threads in general), the difference
>> should become smaller. I would also suggest to use the PSMP version.
>>
>> Alfio
>>
>>
>> Il giorno giovedì 2 novembre 2017 15:33:13 UTC+1, Faraz H ha scritto:
>>>
>>> Thanks, I am attaching the output of two runs. One with the gcc4.9
>>> executable and other with the MKL libraries and gcc4.9. Interestingly the
>>> results are not always consistent when I run the model multiple times.
>>> Sometimes the MKL one is faster by ~30 seconds overall. Sometimes slower.
>>> So perhaps something going on my system. Curious what you see.
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20171107/677e2a53/attachment.htm>
More information about the CP2K-user
mailing list