[CP2K-user] [CP2K:11136] Re: van der Waals regtests fail on Intel KNL, and build glitches

Alfio Lazzaro alfio.... at gmail.com
Tue Jan 8 18:05:27 UTC 2019


OK, let's restrict more: could you try with a single MPI rank and a single 
thread? Have you tried -O0?
Timings are related to the computation done by CP2K. I would assume that 
the number of SCF steps and/or the density matrices are different in all 
cases (this is a consequence of some numerical problems). In particular, 
there are much more calls for the PSMP and that's why the execution is much 
slower. Could you confirm that? 

We really need to have a common baseline and then understand what's 
wrong with PSMP...



Il giorno martedì 8 gennaio 2019 18:58:19 UTC+1, Ronald Cohen ha scritto:
>
> OK, the results of the argon06 are rather surprising. There is no crash in 
> the batch, as opposed to regtest, environment. However, the psmp answers 
> are wrong,
> even with OMP1. The timings are quite strange too!
>
> MPI: 64
>
> Intel POPT gives
> Total Energy               =       -85.1949606240 time 30.98
> regtest -85.19494678034609
>
> Intel PSMP OMP=1 gives
> 85.1906685165 time 434.5
> Intel PSMP OMP=2 gives
> -85.1907567561 time 436.9
> Intel PSMP OMP=4 gives
> -85.1915871949  time 8.2
>
> Intel PSMP MPI 4 OMP=4 gives
> -85.18890965 4.5 seconds 18:33:12.002 18:33:04.127
>  
> ---
> Ron Cohen
> rec... at gmail.com <javascript:>
> skypename: ronaldcohen
> twitter: @recohen3
>
>
>
>
> On Jan 8, 2019, at 11:07 AM, Alfio Lazzaro <alfi... at gmail.com 
> <javascript:>> wrote:
>
> OK, let's focus on this test then.
> The message is not really useful. Could you try a single thread? I think 
> 18.0.5 should be fine, but I would suggest to start with -O0 run. Somehow 
> it should run. Then we can use the output as a reference....Alfio
>
> Alfio
>
>
>
>
> Il giorno martedì 8 gennaio 2019 10:12:58 UTC+1, Ronald Cohen ha scritto:
>>
>> Thank you so much. I don’t have 18.03 installed. I was also having 
>> problem with earlier versions, but did not document so carefully
>> and did not run regtests. When I try my own job (not the regtest) with 
>> non-local vdW PSMP just never converges and it segfaults on the stress 
>> calculation.
>>
>>
>>
>> Here is the end of the argon07 test:
>>
>>  Leaving inner SCF loop after reaching     2 steps.
>>
>>
>>   Electronic density on regular grids:        -32.0000000000       
>>  0.0000000000
>>   Core density on regular grids:               31.9999999977       
>> -0.0000000023
>>   Total charge density on r-space grids:       -0.0000000023
>>   Total charge density g-space grids:          -0.0000000023
>>
>>   Overlap energy of the core charge distribution:               
>> 0.00000000000000
>>   Self energy of the core charge distribution:               
>> -180.54066673528200
>>   Core Hamiltonian energy:                                     
>> 42.11893140752033
>>   Hartree energy:                                             
>>  68.47313966072379
>>   Exchange-correlation energy:                               
>>  -15.35702018709375
>>   Dispersion energy:                                           
>>  0.25474393253353
>>
>>   Total energy:                                               
>> -85.05087192159809
>>
>>  *** WARNING in qs_scf.F:542 :: SCF run NOT converged ***
>>
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> forrtl: severe (174): SIGSEGV, segmentation fault occurred
>> EXIT CODE:  174  MEANING:  RUNTIME FAIL
>> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>>
>>
>>
>> ---
>> Ron Cohen
>> rec... at gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>>
>>
>> On Jan 8, 2019, at 9:13 AM, Alfio Lazzaro <alfi... at gmail.com> wrote:
>>
>> Hi Ron,
>> Could you share one of the FAILED logs? For instance the log of 
>> argon07.inp. The regtests script prints the last part of the log at the end 
>> of its execution... My suspicious is that these failing tests are dying 
>> because of a numerical assert in CP2K, so they can be included in the WRONG 
>> category. Now, you are saying that the problem comes from PSMP build, so my 
>> first try (very conservative) would be to use a single thread and see if it 
>> works. Note that CP2K tests with 2 threads (while you are using 4). Another 
>> possibility would be to avoid AVX512 vectorization (CP2K doesn't test it 
>> yet). Also, I have just realized that CP2K doesn't test 18.0.5 for PSMP and 
>> 19.x at all (see CP2K tests at https://dashboard.cp2k.org/ ). So, my 
>> suggestion is to reproduce what it is already tested by CP2K. A good 
>> starting point is this test 
>>  
>>
>> https://www.cp2k.org/static/regtest/trunk/swan-skl28/CRAY-XC40-intel-mkl.psmp_18.0.3.222.out
>>
>> Alfio
>>
>>
>> Il giorno lunedì 7 gennaio 2019 20:34:23 UTC+1, Ronald Cohen ha scritto:
>>>
>>> I did build also with precise but did not help. The values are very 
>>> wrong , not slightly. Ron
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 7, 2019, at 20:13, Anton Kudelin <arch... at gmail.com> wrote:
>>>
>>> Could you add "-fp-model precise" to CFLAGS and FCFLAGS? It won't fix 
>>> 'RUNTIME FAIL', but could help with 'WRONG RESULT'.
>>>
>>> On Monday, January 7, 2019 at 9:06:28 PM UTC+3, Ronald Cohen wrote:
>>>>
>>>> So I tried:
>>>>
>>>> export KMP_STACKSIZE=512M
>>>> rcohen at tomcat3:~/CP2K/cp2k$ ./tools/regtesting/do_regtest -arch 
>>>> Linux-x86-64-intel -version psmp -restrictdir QS/regtest-dft-vdw-corr-1/ 
>>>> -restrictdir QS/regtest-dft-vdw-corr-2/ -restrictdir 
>>>> QS/regtest-dft-vdw-corr-3/ -restrictdir QS/regtest-dft-vdw-corr-3/ -nobuild 
>>>> -mpiranks 4 -ompthreads 4 -maxtasks 16 |& tee testwith512MKMP_STACKSIZE.out 
>>>> &
>>>> and I still get:
>>>>
>>>> < 
>>>> /home/rcohen/CP2K/cp2k/TEST-Linux-x86-64-intel-psmp-2019-01-07_18-24-16/tests/QS/regtest-dft-vdw-corr-3 
>>>> (1 of 3) done in 775.00 sec
>>>> >>> 
>>>> /home/rcohen/CP2K/cp2k/TEST-Linux-x86-64-intel-psmp-2019-01-07_18-24-16/tests/QS/regtest-dft-vdw-corr-3
>>>>     argon05.inp                                               
>>>> -85.02462435591488  WRONG RESULT TEST 1 
>>>>     argon06.inp                                               
>>>> -85.18989253445228  WRONG RESULT TEST 1 
>>>>     argon07.inp                                               
>>>> -85.05087192159809         RUNTIME FAIL 
>>>>     argon08.inp                                               
>>>> -85.05201740647929         RUNTIME FAIL 
>>>>     argon09.inp                                               
>>>> -85.05086520280044         RUNTIME FAIL 
>>>>     argon10.inp                                               
>>>> -85.05070440200512         RUNTIME FAIL 
>>>>     argon11.inp                                               
>>>> -84.69892988333885         RUNTIME FAIL 
>>>>     argon12.inp                                               
>>>> -84.69900817368848         RUNTIME FAIL 
>>>>     argon13.inp                                               
>>>> -84.81306482759408  WRONG RESULT TEST 1 
>>>>     argon14.inp                                               
>>>> -84.69889654472566  WRONG RESULT TEST 1 
>>>>     argon-beef.inp                                           
>>>>  -42.46311172518392  WRONG RESULT TEST 1 
>>>>     dftd3bj_t1.inp                                             
>>>> -0.00355123783846     OK (   1.19 sec) 
>>>>     dftd3bj_t2.inp                                             
>>>> -0.05897356220363     OK (   2.20 sec) 
>>>>     dftd3bj_t3.inp                                             
>>>> -0.00112424003807     OK (   3.75 sec) 
>>>>     dftd3bj_t4.inp                                               
>>>>  -84.2983390350     OK (   3.86 sec) 
>>>> <<< 
>>>> /home/rcohen/CP2K/cp2k/TEST-Linux-x86-64-intel-psmp-2019-01-07_18-24-16/tests/QS/regtest-dft-vdw-corr-3 
>>>> (1 of 3) done in 775.00 sec
>>>> Starting regression tests in 
>>>> /home/rcohen/CP2K/cp2k/TEST-Linux-x86-64-intel-psmp-2019-01-07_18-24-16/tests/QS/regtest-dft-vdw-corr-2 
>>>> (2 of 3)
>>>> Starting regression tests in 
>>>> /home/rcohen/CP2K/cp2k/TEST-Linux-x86-64-intel-psmp-2019-01-07_18-24-16/tests/QS/regtest-dft-vdw-corr-2 
>>>> (2 of 3)
>>>>
>>>>
>>>> Almost all of the non vdw routines pass.
>>>>
>>>> Sincerely,
>>>>
>>>> Ron
>>>>
>>>> ---
>>>> Ron Cohen
>>>> rec... at gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>>
>>>>
>>>>
>>>>
>>>> On Jan 7, 2019, at 6:12 PM, Robert Schade <robe... at uni-paderborn.de> 
>>>> wrote:
>>>>
>>>> Signed PGP part
>>>> Could you try setting KMP_STACKSIZE to something large in the terminal
>>>> session with "export KMP_STACKSIZE=512m" before you rerun the regtests
>>>> with your intel-psmp-binary that failed before?
>>>> Please also make sure that the general stack size is not the problem
>>>> by running "ulimt -s unlimited" in the same terminal where you want to
>>>> execute the regtests.
>>>> Best Wishes
>>>> Robert
>>>>
>>>> On 07.01.19 18:00, Ronald Cohen wrote:
>>>> > BTW, in case it was not clear. My Intel builds of POPT and PSMP
>>>> > versions were error free. The problems were all run time.
>>>> >
>>>> > Ron
>>>> >
>>>> > --- Ron Cohen rec... at gmail.com <mailto:... at gmail.com>
>>>> > skypename: ronaldcohen twitter: @recohen3
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >> On Jan 7, 2019, at 5:39 PM, Robert Schade
>>>> >> <robe... at uni-paderborn.de
>>>> >> <mailto:rob... at uni-paderborn.de>> wrote:
>>>> >>
>>>> >> Signed PGP part r is automatically private because it is the
>>>> >> first iteration variable. Every drho(s, i) is only read and
>>>> >> written in exactly one loop iteration. The statement
>>>> >> "COLLAPSE(3)" collapses the three perfectly nested loops into one
>>>> >> loop. So, IMHO, this code looks ok. Best Wishes Robert
>>>> >>
>>>> >>
>>>> >> On 07.01.19 14:52, Ronald Cohen wrote:
>>>> >>> Yes, I agree. I have tried the 2018.05 and the 2019.1 intel
>>>> >>> compilers. The POPT version runs fine, but the PSMP version
>>>> >>> fails in the vDW routines. I find things like: in
>>>> >>> qs_dispersion_nonloc.F
>>>> >>>
>>>> >>> !$OMP PARALLEL DO DEFAULT(NONE)                      & !$OMP
>>>> >>> SHARED(ispin,i,n,lo,drho,drho_r)   & !$OMP
>>>> >>> PRIVATE(s) & !$OMP             COLLAPSE(3) DO r = 0, n(3)-1 DO
>>>> >>> q = 0, n(2)-1 DO p = 0, n(1)-1 s = r*n(2)*n(1)+q*n(1)+p+1
>>>> >>> drho(s, i) = drho(s, i)+drho_r(i, ispin)%pw%cr3d(p+lo(1), q
>>>> >>> +lo(2), r+lo(3)) END DO END DO END DO !$OMP END PARALLEL DO END
>>>> >>> DO END DO
>>>> >>>
>>>> >>> Doesn’t this have to be marked as a reduction? And shouldn’t r,
>>>> >>> q, p be labeled private? Perhaps this is automatic, but I do
>>>> >>> not see that said anywhere. Does gnu treat such differently
>>>> >>> than intel? Just ideas.
>>>> >>>
>>>> >>> I am currently trying the toolchain, but it is building
>>>> >>> everything from scratch, including blas, lapack, scalapack etc
>>>> >>> etc, so will take days.
>>>> >>>
>>>> >>> Thank you for your help,
>>>> >>>
>>>> >>> Sincerely,
>>>> >>>
>>>> >>> Ron
>>>> >>>
>>>> >>> --- Ron Cohen rec... at gmail.com <mailto:... at gmail.com>
>>>> >> <mailto:... at gmail.com>
>>>> >>> skypename: ronaldcohen twitter: @recohen3
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>> On Jan 7, 2019, at 2:16 PM, Robert Schade
>>>> >>>> <robe... at uni-paderborn.de
>>>> >>>> <mailto:rob... at uni-paderborn.de>
>>>> >>>> <mailto:rob... at uni-paderborn.de>> wrote:
>>>> >>>>
>>>> >>>> Building cp2k on Intel Xeon Phi Knights Landing (KNL, not to
>>>> >>>> be confused with KNC!) is not different from building it on
>>>> >>>> any other Intel CPU. Hence, I think that the failing regtests
>>>> >>>> point to an underlying issue. Which exact version of the
>>>> >>>> Intel Compiler and MKL have you tried? Best Wishes Robert
>>>> >>>>
>>>> >>>> On 06.01.19 01:59, Ronald Cohen wrote:
>>>> >>>>> OK—sorry for all the noise. I am trying:
>>>> >>>>> ./install_cp2k_toolchain.sh --with-elpa=install
>>>> >>>>> --with-libint=install --with-gcc=install I hate not being
>>>> >>>>> able to use my intel tools which work for me for everything
>>>> >>>>> else just fine.
>>>> >>>>>
>>>> >>>>> Ron
>>>> >>>>>
>>>> >>>>
>>>> >>>> -- Robert Schade Paderborn Center for Parallel Computing
>>>> >>>> (PC2) University of Paderborn Warburger Str. 100 D-33098
>>>> >>>> Paderborn Germany robe... at uni-paderborn.de
>>>> >> <mailto:rob... at uni-paderborn.de>
>>>> >>>> <mailto:rob... at uni-paderborn.de> +49/(0)5251/60-5393
>>>> >>>>
>>>> >>>> -- You received this message because you are subscribed to a
>>>> >>>> topic in the Google Groups "cp2k" group. To unsubscribe from
>>>> >>>> this topic, visit
>>>> >>>> https://groups.google.com/d/topic/cp2k/gzmRqKNt62U/unsubscribe.
>>>> >>
>>>> >>>>
>>>> >> To unsubscribe from this group and all its topics, send an email
>>>> >>>> to cp2k+... at googlegroups.com
>>>> >> <mailto:cp2... at googlegroups.com>. To post to this
>>>> >> group, send
>>>> >>>> email to cp... at googlegroups.com
>>>> >>>> <mail... at googlegroups.com>.
>>>> >> Visit this group at
>>>> >>>> https://groups.google.com/group/cp2k. For more options,
>>>> >>>> visit https://groups.google.com/d/optout.
>>>> >>>
>>>> >>> -- You received this message because you are subscribed to the
>>>> >>> Google Groups "cp2k" group. To unsubscribe from this group and
>>>> >>> stop receiving emails from it, send an email to
>>>> >>> cp2k+... at googlegroups.com
>>>> >> <mailto:cp2... at googlegroups.com>
>>>> >>> <mailto:cp2... at googlegroups.com>. To post to this
>>>> >>> group, send email to cp... at googlegroups.com
>>>> >>> <mail... at googlegroups.com> <mail... at googlegroups.com>.
>>>> >>> Visit this group at https://groups.google.com/group/cp2k. For
>>>> >>> more options, visit https://groups.google.com/d/optout.
>>>> >>
>>>> >> -- Robert Schade Paderborn Center for Parallel Computing (PC2)
>>>> >> University of Paderborn Warburger Str. 100 D-33098 Paderborn
>>>> >> Germany robe... at uni-paderborn.de
>>>> >> <mailto:rob... at uni-paderborn.de> +49/(0)5251/60-5393
>>>> >>
>>>> >
>>>> > -- You received this message because you are subscribed to the
>>>> > Google Groups "cp2k" group. To unsubscribe from this group and stop
>>>> > receiving emails from it, send an email to
>>>> > cp2k+... at googlegroups.com
>>>> > <mailto:cp2... at googlegroups.com>. To post to this group,
>>>> > send email to cp... at googlegroups.com
>>>> > <mail... at googlegroups.com>. Visit this group at
>>>> > https://groups.google.com/group/cp2k. For more options, visit
>>>> > https://groups.google.com/d/optout.
>>>>
>>>> --
>>>> Robert Schade
>>>> Paderborn Center for Parallel Computing (PC2)
>>>> University of Paderborn
>>>> Warburger Str. 100
>>>> D-33098 Paderborn
>>>> Germany
>>>> robe... at uni-paderborn.de
>>>> +49/(0)5251/60-5393
>>>>
>>>>
>>>>
>>> -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "cp2k" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/cp2k/gzmRqKNt62U/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> cp2k+... at googlegroups.com.
>>> To post to this group, send email to cp... at googlegroups.com.
>>> Visit this group at https://groups.google.com/group/cp2k.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "cp2k" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/cp2k/gzmRqKNt62U/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> cp2k+... at googlegroups.com.
>> To post to this group, send email to cp... at googlegroups.com.
>> Visit this group at https://groups.google.com/group/cp2k.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
> -- 
> You received this message because you are subscribed to a topic in the 
> Google Groups "cp2k" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/cp2k/gzmRqKNt62U/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> cp2k+... at googlegroups.com <javascript:>.
> To post to this group, send email to cp... at googlegroups.com <javascript:>.
> Visit this group at https://groups.google.com/group/cp2k.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20190108/e85c07dd/attachment.htm>


More information about the CP2K-user mailing list