No speedup using Intel MKL libraries?

Alfio Lazzaro alfio.... at gmail.com
Tue Nov 7 16:10:08 CET 2017


Well, the sequential version of MKL is somehow the best choice since it can 
be that we call MKL functions within OpenMP threads in CP2K. I must say 
that this is was an old solution because of safety, nowadays MKL is able to 
dynamically change the number of threads (see 

http://scc.ustc.edu.cn/zlsc/tc4600/intel/2016.0.109/mkl/common/mkl_userguide/GUID-0332F698-1696-4E9F-A908-EA4B1484D114.htm

), so it should be fine to use the threaded version.

Alfio

Il giorno martedì 7 novembre 2017 15:40:27 UTC+1, Faraz H ha scritto:
>
> Thanks Alfio for looking deeper into it!  I ran more tests and you are 
> indeed correct; i.e the MKL functions are running serially ( one cpu ). But 
> the rest of the code is using all cpus of the machine. In the local.ssmp 
> makefile I saw that libmkl_sequential.a is linked. So I changed it to 
> libmkl_intel_thread.a and libiomp5.a . Now the benchmark H20-128 runs in 5 
> minutes on 28 cpu machine compared to 7 minutes for the non-MKL linked 
> executable. 
>
> I wonder if it is a bug that the toolchain links the serial MKL when 
> creating the local.ssmp makefiles? In what situation would someone want the 
> sequential MKL libraries linked instead of the parallel ones for ssmp ?
>
>
> On Monday, November 6, 2017 at 7:11:53 AM UTC-5, Alfio Lazzaro wrote:
>>
>> Dear Farah,
>> OK, this is the comparison of the two runs for functions where I see the 
>> highest timing discrepancy (time in seconds, second column w/ MKL, third 
>> column w/o MKL)
>>
>> dbcsr_make_untransposed_blocks     4.139     1.591
>> cp_fm_gemm                         5.691     1.087
>> setup_rec_index_2d                 6.330     1.741
>> cp_fm_cholesky_decompose          11.539     1.703  
>> cp_fm_cholesky_invert             26.048     3.031 
>>
>> Well, personally I don't understand the differences in the 1st and 3rd 
>> line, likely it was a fluctuation.
>> For the other lines, these are MKL related (DGEMM and 
>> Cholesky decomposition). My suspicious is that you are using MKL in 
>> sequential, while Openblas is somehow using threads. A way to test it is to 
>> run with a single thread (or less threads in general), the difference 
>> should become smaller. I would also suggest to use the PSMP version.
>>
>> Alfio
>>
>>
>> Il giorno giovedì 2 novembre 2017 15:33:13 UTC+1, Faraz H ha scritto:
>>>
>>> Thanks, I am attaching the output of two runs. One with the gcc4.9 
>>> executable and other with the MKL libraries and gcc4.9. Interestingly the 
>>> results are not always consistent when I run the model multiple times. 
>>> Sometimes the MKL one is faster by ~30 seconds overall. Sometimes slower. 
>>> So perhaps something going on my system. Curious what you see.
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20171107/677e2a53/attachment.html>


More information about the CP2K-user mailing list