No speedup using Intel MKL libraries?

Alfio Lazzaro alfio.... at
Tue Nov 7 15:10:08 UTC 2017

Well, the sequential version of MKL is somehow the best choice since it can 
be that we call MKL functions within OpenMP threads in CP2K. I must say 
that this is was an old solution because of safety, nowadays MKL is able to 
dynamically change the number of threads (see

), so it should be fine to use the threaded version.


Il giorno martedì 7 novembre 2017 15:40:27 UTC+1, Faraz H ha scritto:
> Thanks Alfio for looking deeper into it!  I ran more tests and you are 
> indeed correct; i.e the MKL functions are running serially ( one cpu ). But 
> the rest of the code is using all cpus of the machine. In the local.ssmp 
> makefile I saw that libmkl_sequential.a is linked. So I changed it to 
> libmkl_intel_thread.a and libiomp5.a . Now the benchmark H20-128 runs in 5 
> minutes on 28 cpu machine compared to 7 minutes for the non-MKL linked 
> executable. 
> I wonder if it is a bug that the toolchain links the serial MKL when 
> creating the local.ssmp makefiles? In what situation would someone want the 
> sequential MKL libraries linked instead of the parallel ones for ssmp ?
> On Monday, November 6, 2017 at 7:11:53 AM UTC-5, Alfio Lazzaro wrote:
>> Dear Farah,
>> OK, this is the comparison of the two runs for functions where I see the 
>> highest timing discrepancy (time in seconds, second column w/ MKL, third 
>> column w/o MKL)
>> dbcsr_make_untransposed_blocks     4.139     1.591
>> cp_fm_gemm                         5.691     1.087
>> setup_rec_index_2d                 6.330     1.741
>> cp_fm_cholesky_decompose          11.539     1.703  
>> cp_fm_cholesky_invert             26.048     3.031 
>> Well, personally I don't understand the differences in the 1st and 3rd 
>> line, likely it was a fluctuation.
>> For the other lines, these are MKL related (DGEMM and 
>> Cholesky decomposition). My suspicious is that you are using MKL in 
>> sequential, while Openblas is somehow using threads. A way to test it is to 
>> run with a single thread (or less threads in general), the difference 
>> should become smaller. I would also suggest to use the PSMP version.
>> Alfio
>> Il giorno giovedì 2 novembre 2017 15:33:13 UTC+1, Faraz H ha scritto:
>>> Thanks, I am attaching the output of two runs. One with the gcc4.9 
>>> executable and other with the MKL libraries and gcc4.9. Interestingly the 
>>> results are not always consistent when I run the model multiple times. 
>>> Sometimes the MKL one is faster by ~30 seconds overall. Sometimes slower. 
>>> So perhaps something going on my system. Curious what you see.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the CP2K-user mailing list