[CP2K-user] [CP2K:14273] Re: Hybrid functional calculation problem

Matt W mattwa... at gmail.com
Tue Nov 24 11:28:32 UTC 2020


I think there is an option to run mixed MPI / openMP. If you run the 
CP2K.psmp executable and give 2 or 4  threads per MPI process you can give 
more memory per process for the integrals. If having to calculate on the 
fly is dominating that might be a good option.

Matt

On Tuesday, November 24, 2020 at 5:53:51 AM UTC Lucas Lodeiro wrote:

> Thanks for your advices!
>
> Now I can run it at least, It is so slow but run. The difference between 
> the little and big cluster was that, in the little one, the total RAM 
> consumption is practically MPI_PROCESS*(Baseline + MAX_MEMORY + 2 full 
> matrices), as Prof. Hutter explains, but in the big one,  there are some 
> cluster process which consume 5 or 10% of each nodes... then I had to 
> optimize the MAX_MEMORY doing some test...
>
> About the ERIs, it is so difficult to have 7 TB for them... I can take 4 
> TB without problem... But to take the whole cluster section is difficult. I 
> try using the SCREENING option to speed it up, taking some ERIs on the fly.
>
> Regards - Lucas Lodeiro
>
>
> El lun, 23 nov 2020 a las 14:18, fa... at gmail.com (<fa... at gmail.com>) 
> escribió:
>
>> Your graph nicely shows that cp2k runs out of memory. As Matt wrote, you 
>> have to decrease MAX_MEMORY to allow enough memory for the rest of the 
>> programm. Here are some details on memory consumption with HF: 
>> https://groups.google.com/g/cp2k/c/DZDVTIORyVY/m/OGjJDJuqBwAJ
>>
>> Of course you can recalculate some the ERI's in each SCF cycle. But that 
>> slows down the minimization by a lot, I'd advise against doing that. Try to 
>> use screening, set a proper value for MAX_MEMORY, and use all the resources 
>> you have to store the ERI's
>>
>> Fabian
>> On Sunday, 22 November 2020 at 23:08:17 UTC+1 Lucas Lodeiro wrote:
>>
>>> Hi Fabian and Matt,
>>>
>>> About the access to the memory, I ran calculations without problems for 
>>> months, using 90% of the node RAM without problems. But to check I set 
>>> ulimit -s unlimited. There are some changes, before using ulimit, the 
>>> calculation crashes and the use of RAM was so low (15%), after using 
>>> ulimit, the calculation crashes equally, but the use of RAM shows a 
>>> sustained rise to the limit and then the calculation crashes. This is a 
>>> change. I adjunct an image.
>>>
>>> About the SCREEN_ON_INITIAL_P, I will use it in the little cluster. I 
>>> like the idea of running 2 calculations as climbing steps.
>>>
>>> I know that the number of the ERIs calculated on the fly should be 0, 
>>> and if it is different from zero, I need to use more RAM to store them and 
>>> to not calculate them at each scf step. But in the case of the little 
>>> cluster, I am using all processors and RAM resources.  But the way, the 
>>> calculation runs without problems when ERIs calculated on the fly at each 
>>> scf step, just is very slow.
>>>
>>> About what Matt comments. In the little cluster, I have a single node 
>>> with 250GB RAM. Then I use MAX_MEMORY = 2600, this is a total of 166.4 GB 
>>> for the ERIS (the output informs 143 GB), and the rest for the whole 
>>> program. 
>>> In the case of the big cluster, we have access to many nodes with 44 
>>> proc and 192GB RAM, and 9 nodes with 44 proc and 768GB RAM. In the first 
>>> case, I use 5 nodes (220 proc) using all memory (960GB), setting MAX_MEMORY 
>>> = 4000 (4.0 GB * 220 proc = 880 GB RAM for ERIs). In the second case, I use 
>>> 5 nodes (220 proc) using all memory (3840GB), setting MAX_MEMORY = 15000 
>>> (15.0 GB * 220 proc = 3300 GB RAM for ERIs).
>>> In both cases the calculation crashes... I do not know if I am 
>>> so credulous , but 3.3 TB of RAM seems, at least, enough to store so many 
>>> of the ERIs...
>>>
>>> Using the data informed in the output of little cluster:
>>>   HFX_MEM_INFO| Number of sph. ERI's calculated:                   
>>> 4879985997918
>>>   HFX_MEM_INFO| Number of sph. ERI's stored in-core:               
>>>  116452577779
>>>   HFX_MEM_INFO| Number of sph. ERI's stored on disk:                     
>>>       0
>>>   HFX_MEM_INFO| Number of sph. ERI's calculated on the fly:       
>>>  4763533420139
>>>
>>> The stored ERI's are the 1/42 of the total ERIs, and use 166.4 GB (143 
>>> GB informed)... Then if I want to store all of them, I need 166.4 GB * 42 = 
>>> ~7.0 TB... Is that correct?
>>> I can get 7.0 TB RAM using 9 nodes with 768 GB RAM each one. But I am 
>>> not so clear about the idea that the amount of RAM is the problem, because 
>>> in the little cluster it runs, calculating almost all ERIs at each scf 
>>> step...
>>>
>>> I am a little surprised why the calculation runs in the little cluster, 
>>> but not in the big one.
>>> Do you guess some other related problem?
>>>
>>> Regards - Lucas
>>>
>>>
>>>
>>> El dom, 22 nov 2020 a las 13:55, Matt W (<mat... at gmail.com>) 
>>> escribió:
>>>
>>>> Your input has
>>>>
>>>>         &MEMORY
>>>>           MAX_MEMORY           4000
>>>>           EPS_STORAGE_SCALING  0.1
>>>>         &END MEMORY
>>>>
>>>> This means that each MPI task (which can be multiple cores) should be 
>>>> able to allocate 4GBi of memory _exclusively_ for the 2 electron 
>>>> integrals.  If there is less than that available it will crash as the 
>>>> memory allocation can't occur. I guess your main cluster has less memory 
>>>> than the smaller one. You need to leave space for the operating system and 
>>>> the rest of the cp2k run besides the 2 electron integrals.
>>>>
>>>> There is another thread where Juerg answers HFX memory in more detail 
>>>> form earlier this year.
>>>>
>>>> Matt
>>>>
>>>> On Sunday, November 22, 2020 at 4:42:47 PM UTC fa... at gmail.com 
>>>> wrote:
>>>>
>>>>> Can cp2k access all the memory on the cluster? On linux you can use 
>>>>> ulimit -s unlimited
>>>>> to remove any limit on the amount of memory a process can use.
>>>>>
>>>>> I usually use SCREEN_ON_INITIAL_P. I found that for large systems it 
>>>>> is faster to run two energy minimizations with the key word enabled (such 
>>>>> that the second restarts from a converged PBE0 wfn) than running a single 
>>>>> minimization without SCREEN_ON_INITIAL_P. But that probably depends on the 
>>>>> system you simulate.
>>>>>
>>>>> You should converge the cutoff with respect to the properties that you 
>>>>> are interested in. Run a test system with increasing cutoff and look at, 
>>>>> e.g. the energy, pdos, etc.
>>>>>
>>>>> Number of sph. ERI's calculated on the fly:        4763533420139 
>>>>> This number should always be 0. If it is larger, increase the memory 
>>>>> cp2k has available.
>>>>>
>>>>> Fabian
>>>>> On Sunday, 22 November 2020 at 17:24:13 UTC+1 Lucas Lodeiro wrote:
>>>>>
>>>>>> Dear Fabian,
>>>>>>
>>>>>> Thanks for your advise. I forgot to clarify the time ejecution... my 
>>>>>> mistake. 
>>>>>> The calculation runs for 5 or 7 minutes, and stops... the walltime 
>>>>>> for the calculation was set as 72hrs, then I do not believe this is the 
>>>>>> problem. Now I am running the same input in a littler cluster (different 
>>>>>> form the problematic crash) with 64 proc and 250GB RAM, and the calculation 
>>>>>> works fine (so so slow, 9 hr per scf step, but runs... the total RAM 
>>>>>> assigned for the ERI's is not sufficient but the problem is not appear)... 
>>>>>> It is no practical to use this little cluster, then I need to fix the 
>>>>>> problem in the big one, to use more RAM and more processors (more than 220 
>>>>>> it is possible), but as the program does not show what is happening, I 
>>>>>> cannot tell anything to the cluster admin to recompile or fix the problem. 
>>>>>> :(
>>>>>>
>>>>>> This is the output in the little cluster:
>>>>>>
>>>>>>   Step     Update method      Time    Convergence         Total 
>>>>>> energy    Change
>>>>>>   
>>>>>> ------------------------------------------------------------------------------
>>>>>>
>>>>>>   HFX_MEM_INFO| Est. max. program size before HFX [MiB]:             
>>>>>>        1371
>>>>>>
>>>>>>  *** WARNING in hfx_energy_potential.F:605 :: The Kohn Sham matrix is 
>>>>>> not  ***
>>>>>>
>>>>>>  *** 100% occupied. This may result in incorrect Hartree-Fock 
>>>>>> results. Try ***
>>>>>>  *** to decrease EPS_PGF_ORB and EPS_FILTER_MATRIX in the QS section. 
>>>>>> For  ***
>>>>>>  *** more information see FAQ: 
>>>>>> https://www.cp2k.org/faq:hfx_eps_warning    ***
>>>>>>
>>>>>>   HFX_MEM_INFO| Number of cart. primitive ERI's calculated:       
>>>>>> 27043173676632
>>>>>>   HFX_MEM_INFO| Number of sph. ERI's calculated:                   
>>>>>> 4879985997918
>>>>>>   HFX_MEM_INFO| Number of sph. ERI's stored in-core:               
>>>>>>  116452577779
>>>>>>   HFX_MEM_INFO| Number of sph. ERI's stored on disk:                 
>>>>>>           0
>>>>>>   HFX_MEM_INFO| Number of sph. ERI's calculated on the fly:       
>>>>>>  4763533420139
>>>>>>   HFX_MEM_INFO| Total memory consumption ERI's RAM [MiB]:             
>>>>>>     143042
>>>>>>   HFX_MEM_INFO| Whereof max-vals [MiB]:                               
>>>>>>       1380
>>>>>>   HFX_MEM_INFO| Total compression factor ERI's RAM:                   
>>>>>>       6.21
>>>>>>   HFX_MEM_INFO| Total memory consumption ERI's disk [MiB]:           
>>>>>>           0
>>>>>>   HFX_MEM_INFO| Total compression factor ERI's disk:                 
>>>>>>        0.00
>>>>>>   HFX_MEM_INFO| Size of density/Fock matrix [MiB]:                   
>>>>>>         266
>>>>>>   HFX_MEM_INFO| Size of buffers [MiB]:                               
>>>>>>          98
>>>>>>   HFX_MEM_INFO| Number of periodic image cells considered:           
>>>>>>           5
>>>>>>   HFX_MEM_INFO| Est. max. program size after HFX  [MiB]:             
>>>>>>        3834
>>>>>>
>>>>>>      1 NoMix/Diag. 0.40E+00 ******     5.46488333   
>>>>>>  -20625.2826573514 -2.06E+04
>>>>>>
>>>>>> About the SCREEN_ON_INITIAL_P, I read that to use it, you need a very 
>>>>>> good guess (more than de GGA converged one) as for example the last step or 
>>>>>> frame from a GEO_OPT or MD... Is it really useful when the guess is the GGA 
>>>>>> wavefunction?
>>>>>> About the CUTOFF_RADIUS, I read that 6 or 7 it is a good compromise, 
>>>>>> and as my cell is approximately twice, I use the minimal image convention 
>>>>>> to decide the 8.62 number which is near the recomended (6 or 7). If I 
>>>>>> reduce it, does the computational cost diminish considerably?
>>>>>>
>>>>>> Regards - Lucas
>>>>>>
>>>>>> El dom, 22 nov 2020 a las 12:53, fa... at gmail.com (<
>>>>>> fa... at gmail.com>) escribió:
>>>>>>
>>>>>>> Dear Lucas,
>>>>>>>
>>>>>>> cp2k was computes the four-center integrals during (or prior) to the 
>>>>>>> first SCF cycle. I assume the job ran out of time during this task  For a 
>>>>>>> system with more than 1000 atoms this takes a lot of time. With only 220 
>>>>>>> CPUs this could take several hours in fact.
>>>>>>>
>>>>>>> To speed up the calculation you should use SCREEN_ON_INITIAL_P T and 
>>>>>>> restart using a well converged PBE wfn. Other than that, there is little 
>>>>>>> you can do other than assign the job more time and/or CPUs. (Of course, 
>>>>>>> reducing CUTOFF_RADIUS        8.62 would help too but could negatively 
>>>>>>> affect the result).
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Fabian
>>>>>>>
>>>>>>> On Sunday, 22 November 2020 at 01:21:05 UTC+1 Lucas Lodeiro wrote:
>>>>>>>
>>>>>>>> Hi all, 
>>>>>>>> I need to perform a hybrid calculation with CP2K7.1, over a big 
>>>>>>>> system (+1000 atoms). I study the manual, the tutorials and some videos of 
>>>>>>>> CP2K developers to improve my input. But the program exits the calculation 
>>>>>>>> when the HF part is running... I see the memory usage on the fly, and there 
>>>>>>>> is no peak which explains the fail (I used 4000Mb with 220 processors).
>>>>>>>> The output does not show some explanation... Thinking in the 
>>>>>>>> memory, I try with a largemem node at our cluster, using 15000Mb with 220 
>>>>>>>> processors, but the program exists at the same point without message, just 
>>>>>>>> killing the process. 
>>>>>>>> The output shows a warning:
>>>>>>>>
>>>>>>>>  *** WARNING in hfx_energy_potential.F:591 :: The Kohn Sham matrix 
>>>>>>>> is not  ***
>>>>>>>>  *** 100% occupied. This may result in incorrect Hartree-Fock 
>>>>>>>> results. Try ***
>>>>>>>>  *** to decrease EPS_PGF_ORB and EPS_FILTER_MATRIX in the QS 
>>>>>>>> section. For  ***
>>>>>>>>  *** more information see FAQ: 
>>>>>>>> https://www.cp2k.org/faq:hfx_eps_warning    ***
>>>>>>>>
>>>>>>>> but I read this is not a very complicated issue, and the 
>>>>>>>> calculation has to continue and not crash
>>>>>>>> Also I decrease the EPS__PGF_ORB, but the warning and the problem 
>>>>>>>> persist. 
>>>>>>>>
>>>>>>>> I do not know if the problem could be located in other parts of my 
>>>>>>>> input... for example I use the PBE0-T_C-LR (I use PBC for XY), and ADMM. In 
>>>>>>>> the ADMM options, I use ADMM_PURIFICATION_METHOD = NONE, due to I read that 
>>>>>>>> ADMM1 is the only one useful for smearing calculations. 
>>>>>>>>
>>>>>>>> I run this system with PBE (for the first guess of PBE0), and there 
>>>>>>>> is no problem in that case.
>>>>>>>> Moreover, I try with other CP2K versions (7.0, 6.1 and 5.1) 
>>>>>>>> compiled into the cluster with (libint_max_am=6), and the calculation 
>>>>>>>> crash, but show this problem:
>>>>>>>>
>>>>>>>>
>>>>>>>>  *******************************************************************************
>>>>>>>>  *   ___                                                           
>>>>>>>>             *
>>>>>>>>  *  /   \                                                           
>>>>>>>>            *
>>>>>>>>  * [ABORT]                                                         
>>>>>>>>             *
>>>>>>>>  *  \___/       CP2K and libint were compiled with different 
>>>>>>>> LIBINT_MAX_AM.    *
>>>>>>>>  *    |                                                             
>>>>>>>>            *
>>>>>>>>  *  O/|                                                             
>>>>>>>>            *
>>>>>>>>  * /| |                                                             
>>>>>>>>            *
>>>>>>>>  * / \                                               
>>>>>>>>  hfx_libint_wrapper.F:134 *
>>>>>>>>
>>>>>>>>  *******************************************************************************
>>>>>>>>
>>>>>>>>
>>>>>>>>  ===== Routine Calling Stack ===== 
>>>>>>>>
>>>>>>>>             2 hfx_create
>>>>>>>>             1 CP2K
>>>>>>>>
>>>>>>>> It seems like this problem is not present in the 7.1 version, as 
>>>>>>>> the program does not show it, and the compilation information does not 
>>>>>>>> show LIBINT_MAX_AM value...
>>>>>>>>
>>>>>>>> If somebody could give me some advice, I will appreciate it. :)
>>>>>>>> I attach the input file, and the output file for 7.1 version.
>>>>>>>>
>>>>>>>> Regards - Lucas Lodeiro
>>>>>>>>
>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "cp2k" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to cp... at googlegroups.com.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/cp2k/96479ce2-d8a3-4ccf-b55c-0e935878f1c0n%40googlegroups.com 
>>>>>>> <https://groups.google.com/d/msgid/cp2k/96479ce2-d8a3-4ccf-b55c-0e935878f1c0n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "cp2k" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to cp... at googlegroups.com.
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/cp2k/aa6b0a55-9d21-4da6-a3bb-f6f62ea0768bn%40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/cp2k/aa6b0a55-9d21-4da6-a3bb-f6f62ea0768bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "cp2k" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to cp... at googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/cp2k/87e96bc0-fa01-43ee-9d95-635a45db2d41n%40googlegroups.com 
>> <https://groups.google.com/d/msgid/cp2k/87e96bc0-fa01-43ee-9d95-635a45db2d41n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20201124/7abad060/attachment.htm>


More information about the CP2K-user mailing list