[CP2K-user] [CP2K:14279] Re: Hybrid functional calculation problem

Lucas Lodeiro eluni... at gmail.com
Tue Nov 24 15:30:26 UTC 2020


Thanks Matt!

Now I am using popt 7.1, I will ask to compile the psmp flavor of CP2K.
Today, I optimize the MAX_MEMORY, and with 400 proc each scf step needs 30
minutes (calculating most of the ERIs on the fly), which is an affordable
time for the whole calculation.

Regards - Lucas Lodeiro

El mar, 24 nov 2020 a las 8:28, Matt W (<mattwa... at gmail.com>) escribió:

> I think there is an option to run mixed MPI / openMP. If you run the
> CP2K.psmp executable and give 2 or 4  threads per MPI process you can give
> more memory per process for the integrals. If having to calculate on the
> fly is dominating that might be a good option.
>
> Matt
>
> On Tuesday, November 24, 2020 at 5:53:51 AM UTC Lucas Lodeiro wrote:
>
>> Thanks for your advices!
>>
>> Now I can run it at least, It is so slow but run. The difference between
>> the little and big cluster was that, in the little one, the total RAM
>> consumption is practically MPI_PROCESS*(Baseline + MAX_MEMORY + 2 full
>> matrices), as Prof. Hutter explains, but in the big one,  there are some
>> cluster process which consume 5 or 10% of each nodes... then I had to
>> optimize the MAX_MEMORY doing some test...
>>
>> About the ERIs, it is so difficult to have 7 TB for them... I can take 4
>> TB without problem... But to take the whole cluster section is difficult. I
>> try using the SCREENING option to speed it up, taking some ERIs on the fly.
>>
>> Regards - Lucas Lodeiro
>>
>>
>> El lun, 23 nov 2020 a las 14:18, fa... at gmail.com (<fa... at gmail.com>)
>> escribió:
>>
>>> Your graph nicely shows that cp2k runs out of memory. As Matt wrote, you
>>> have to decrease MAX_MEMORY to allow enough memory for the rest of the
>>> programm. Here are some details on memory consumption with HF:
>>> https://groups.google.com/g/cp2k/c/DZDVTIORyVY/m/OGjJDJuqBwAJ
>>>
>>> Of course you can recalculate some the ERI's in each SCF cycle. But that
>>> slows down the minimization by a lot, I'd advise against doing that. Try to
>>> use screening, set a proper value for MAX_MEMORY, and use all the resources
>>> you have to store the ERI's
>>>
>>> Fabian
>>> On Sunday, 22 November 2020 at 23:08:17 UTC+1 Lucas Lodeiro wrote:
>>>
>>>> Hi Fabian and Matt,
>>>>
>>>> About the access to the memory, I ran calculations without problems for
>>>> months, using 90% of the node RAM without problems. But to check I set
>>>> ulimit -s unlimited. There are some changes, before using ulimit, the
>>>> calculation crashes and the use of RAM was so low (15%), after using
>>>> ulimit, the calculation crashes equally, but the use of RAM shows a
>>>> sustained rise to the limit and then the calculation crashes. This is a
>>>> change. I adjunct an image.
>>>>
>>>> About the SCREEN_ON_INITIAL_P, I will use it in the little cluster. I
>>>> like the idea of running 2 calculations as climbing steps.
>>>>
>>>> I know that the number of the ERIs calculated on the fly should be 0,
>>>> and if it is different from zero, I need to use more RAM to store them and
>>>> to not calculate them at each scf step. But in the case of the little
>>>> cluster, I am using all processors and RAM resources.  But the way, the
>>>> calculation runs without problems when ERIs calculated on the fly at each
>>>> scf step, just is very slow.
>>>>
>>>> About what Matt comments. In the little cluster, I have a single node
>>>> with 250GB RAM. Then I use MAX_MEMORY = 2600, this is a total of 166.4 GB
>>>> for the ERIS (the output informs 143 GB), and the rest for the whole
>>>> program.
>>>> In the case of the big cluster, we have access to many nodes with 44
>>>> proc and 192GB RAM, and 9 nodes with 44 proc and 768GB RAM. In the first
>>>> case, I use 5 nodes (220 proc) using all memory (960GB), setting MAX_MEMORY
>>>> = 4000 (4.0 GB * 220 proc = 880 GB RAM for ERIs). In the second case, I use
>>>> 5 nodes (220 proc) using all memory (3840GB), setting MAX_MEMORY = 15000
>>>> (15.0 GB * 220 proc = 3300 GB RAM for ERIs).
>>>> In both cases the calculation crashes... I do not know if I am
>>>> so credulous , but 3.3 TB of RAM seems, at least, enough to store so many
>>>> of the ERIs...
>>>>
>>>> Using the data informed in the output of little cluster:
>>>>   HFX_MEM_INFO| Number of sph. ERI's calculated:
>>>> 4879985997918
>>>>   HFX_MEM_INFO| Number of sph. ERI's stored in-core:
>>>>  116452577779
>>>>   HFX_MEM_INFO| Number of sph. ERI's stored on disk:
>>>>         0
>>>>   HFX_MEM_INFO| Number of sph. ERI's calculated on the fly:
>>>>  4763533420139
>>>>
>>>> The stored ERI's are the 1/42 of the total ERIs, and use 166.4 GB (143
>>>> GB informed)... Then if I want to store all of them, I need 166.4 GB * 42 =
>>>> ~7.0 TB... Is that correct?
>>>> I can get 7.0 TB RAM using 9 nodes with 768 GB RAM each one. But I am
>>>> not so clear about the idea that the amount of RAM is the problem, because
>>>> in the little cluster it runs, calculating almost all ERIs at each scf
>>>> step...
>>>>
>>>> I am a little surprised why the calculation runs in the little cluster,
>>>> but not in the big one.
>>>> Do you guess some other related problem?
>>>>
>>>> Regards - Lucas
>>>>
>>>>
>>>>
>>>> El dom, 22 nov 2020 a las 13:55, Matt W (<mat... at gmail.com>)
>>>> escribió:
>>>>
>>>>> Your input has
>>>>>
>>>>>         &MEMORY
>>>>>           MAX_MEMORY           4000
>>>>>           EPS_STORAGE_SCALING  0.1
>>>>>         &END MEMORY
>>>>>
>>>>> This means that each MPI task (which can be multiple cores) should be
>>>>> able to allocate 4GBi of memory _exclusively_ for the 2 electron
>>>>> integrals.  If there is less than that available it will crash as the
>>>>> memory allocation can't occur. I guess your main cluster has less memory
>>>>> than the smaller one. You need to leave space for the operating system and
>>>>> the rest of the cp2k run besides the 2 electron integrals.
>>>>>
>>>>> There is another thread where Juerg answers HFX memory in more detail
>>>>> form earlier this year.
>>>>>
>>>>> Matt
>>>>>
>>>>> On Sunday, November 22, 2020 at 4:42:47 PM UTC fa... at gmail.com
>>>>> wrote:
>>>>>
>>>>>> Can cp2k access all the memory on the cluster? On linux you can use
>>>>>> ulimit -s unlimited
>>>>>> to remove any limit on the amount of memory a process can use.
>>>>>>
>>>>>> I usually use SCREEN_ON_INITIAL_P. I found that for large systems it
>>>>>> is faster to run two energy minimizations with the key word enabled (such
>>>>>> that the second restarts from a converged PBE0 wfn) than running a single
>>>>>> minimization without SCREEN_ON_INITIAL_P. But that probably depends on the
>>>>>> system you simulate.
>>>>>>
>>>>>> You should converge the cutoff with respect to the properties that
>>>>>> you are interested in. Run a test system with increasing cutoff and look
>>>>>> at, e.g. the energy, pdos, etc.
>>>>>>
>>>>>> Number of sph. ERI's calculated on the fly:        4763533420139
>>>>>> This number should always be 0. If it is larger, increase the memory
>>>>>> cp2k has available.
>>>>>>
>>>>>> Fabian
>>>>>> On Sunday, 22 November 2020 at 17:24:13 UTC+1 Lucas Lodeiro wrote:
>>>>>>
>>>>>>> Dear Fabian,
>>>>>>>
>>>>>>> Thanks for your advise. I forgot to clarify the time ejecution... my
>>>>>>> mistake.
>>>>>>> The calculation runs for 5 or 7 minutes, and stops... the walltime
>>>>>>> for the calculation was set as 72hrs, then I do not believe this is the
>>>>>>> problem. Now I am running the same input in a littler cluster (different
>>>>>>> form the problematic crash) with 64 proc and 250GB RAM, and the calculation
>>>>>>> works fine (so so slow, 9 hr per scf step, but runs... the total RAM
>>>>>>> assigned for the ERI's is not sufficient but the problem is not appear)...
>>>>>>> It is no practical to use this little cluster, then I need to fix the
>>>>>>> problem in the big one, to use more RAM and more processors (more than 220
>>>>>>> it is possible), but as the program does not show what is happening, I
>>>>>>> cannot tell anything to the cluster admin to recompile or fix the problem.
>>>>>>> :(
>>>>>>>
>>>>>>> This is the output in the little cluster:
>>>>>>>
>>>>>>>   Step     Update method      Time    Convergence         Total
>>>>>>> energy    Change
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>>
>>>>>>>   HFX_MEM_INFO| Est. max. program size before HFX [MiB]:
>>>>>>>        1371
>>>>>>>
>>>>>>>  *** WARNING in hfx_energy_potential.F:605 :: The Kohn Sham matrix
>>>>>>> is not  ***
>>>>>>>
>>>>>>>  *** 100% occupied. This may result in incorrect Hartree-Fock
>>>>>>> results. Try ***
>>>>>>>  *** to decrease EPS_PGF_ORB and EPS_FILTER_MATRIX in the QS
>>>>>>> section. For  ***
>>>>>>>  *** more information see FAQ:
>>>>>>> https://www.cp2k.org/faq:hfx_eps_warning    ***
>>>>>>>
>>>>>>>   HFX_MEM_INFO| Number of cart. primitive ERI's calculated:
>>>>>>> 27043173676632
>>>>>>>   HFX_MEM_INFO| Number of sph. ERI's calculated:
>>>>>>> 4879985997918
>>>>>>>   HFX_MEM_INFO| Number of sph. ERI's stored in-core:
>>>>>>>  116452577779
>>>>>>>   HFX_MEM_INFO| Number of sph. ERI's stored on disk:
>>>>>>>           0
>>>>>>>   HFX_MEM_INFO| Number of sph. ERI's calculated on the fly:
>>>>>>>  4763533420139
>>>>>>>   HFX_MEM_INFO| Total memory consumption ERI's RAM [MiB]:
>>>>>>>       143042
>>>>>>>   HFX_MEM_INFO| Whereof max-vals [MiB]:
>>>>>>>         1380
>>>>>>>   HFX_MEM_INFO| Total compression factor ERI's RAM:
>>>>>>>         6.21
>>>>>>>   HFX_MEM_INFO| Total memory consumption ERI's disk [MiB]:
>>>>>>>           0
>>>>>>>   HFX_MEM_INFO| Total compression factor ERI's disk:
>>>>>>>        0.00
>>>>>>>   HFX_MEM_INFO| Size of density/Fock matrix [MiB]:
>>>>>>>         266
>>>>>>>   HFX_MEM_INFO| Size of buffers [MiB]:
>>>>>>>          98
>>>>>>>   HFX_MEM_INFO| Number of periodic image cells considered:
>>>>>>>           5
>>>>>>>   HFX_MEM_INFO| Est. max. program size after HFX  [MiB]:
>>>>>>>        3834
>>>>>>>
>>>>>>>      1 NoMix/Diag. 0.40E+00 ******     5.46488333
>>>>>>>  -20625.2826573514 -2.06E+04
>>>>>>>
>>>>>>> About the SCREEN_ON_INITIAL_P, I read that to use it, you need a
>>>>>>> very good guess (more than de GGA converged one) as for example the last
>>>>>>> step or frame from a GEO_OPT or MD... Is it really useful when the guess is
>>>>>>> the GGA wavefunction?
>>>>>>> About the CUTOFF_RADIUS, I read that 6 or 7 it is a good compromise,
>>>>>>> and as my cell is approximately twice, I use the minimal image convention
>>>>>>> to decide the 8.62 number which is near the recomended (6 or 7). If I
>>>>>>> reduce it, does the computational cost diminish considerably?
>>>>>>>
>>>>>>> Regards - Lucas
>>>>>>>
>>>>>>> El dom, 22 nov 2020 a las 12:53, fa... at gmail.com (<
>>>>>>> fa... at gmail.com>) escribió:
>>>>>>>
>>>>>>>> Dear Lucas,
>>>>>>>>
>>>>>>>> cp2k was computes the four-center integrals during (or prior) to
>>>>>>>> the first SCF cycle. I assume the job ran out of time during this task  For
>>>>>>>> a system with more than 1000 atoms this takes a lot of time. With only 220
>>>>>>>> CPUs this could take several hours in fact.
>>>>>>>>
>>>>>>>> To speed up the calculation you should use SCREEN_ON_INITIAL_P T
>>>>>>>> and restart using a well converged PBE wfn. Other than that, there is
>>>>>>>> little you can do other than assign the job more time and/or CPUs. (Of
>>>>>>>> course, reducing CUTOFF_RADIUS        8.62 would help too but could
>>>>>>>> negatively affect the result).
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Fabian
>>>>>>>>
>>>>>>>> On Sunday, 22 November 2020 at 01:21:05 UTC+1 Lucas Lodeiro wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>> I need to perform a hybrid calculation with CP2K7.1, over a big
>>>>>>>>> system (+1000 atoms). I study the manual, the tutorials and some videos of
>>>>>>>>> CP2K developers to improve my input. But the program exits the calculation
>>>>>>>>> when the HF part is running... I see the memory usage on the fly, and there
>>>>>>>>> is no peak which explains the fail (I used 4000Mb with 220 processors).
>>>>>>>>> The output does not show some explanation... Thinking in the
>>>>>>>>> memory, I try with a largemem node at our cluster, using 15000Mb with 220
>>>>>>>>> processors, but the program exists at the same point without message, just
>>>>>>>>> killing the process.
>>>>>>>>> The output shows a warning:
>>>>>>>>>
>>>>>>>>>  *** WARNING in hfx_energy_potential.F:591 :: The Kohn Sham matrix
>>>>>>>>> is not  ***
>>>>>>>>>  *** 100% occupied. This may result in incorrect Hartree-Fock
>>>>>>>>> results. Try ***
>>>>>>>>>  *** to decrease EPS_PGF_ORB and EPS_FILTER_MATRIX in the QS
>>>>>>>>> section. For  ***
>>>>>>>>>  *** more information see FAQ:
>>>>>>>>> https://www.cp2k.org/faq:hfx_eps_warning    ***
>>>>>>>>>
>>>>>>>>> but I read this is not a very complicated issue, and the
>>>>>>>>> calculation has to continue and not crash
>>>>>>>>> Also I decrease the EPS__PGF_ORB, but the warning and the problem
>>>>>>>>> persist.
>>>>>>>>>
>>>>>>>>> I do not know if the problem could be located in other parts of my
>>>>>>>>> input... for example I use the PBE0-T_C-LR (I use PBC for XY), and ADMM. In
>>>>>>>>> the ADMM options, I use ADMM_PURIFICATION_METHOD = NONE, due to I read that
>>>>>>>>> ADMM1 is the only one useful for smearing calculations.
>>>>>>>>>
>>>>>>>>> I run this system with PBE (for the first guess of PBE0), and
>>>>>>>>> there is no problem in that case.
>>>>>>>>> Moreover, I try with other CP2K versions (7.0, 6.1 and 5.1)
>>>>>>>>> compiled into the cluster with (libint_max_am=6), and the calculation
>>>>>>>>> crash, but show this problem:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  *******************************************************************************
>>>>>>>>>  *   ___
>>>>>>>>>             *
>>>>>>>>>  *  /   \
>>>>>>>>>              *
>>>>>>>>>  * [ABORT]
>>>>>>>>>             *
>>>>>>>>>  *  \___/       CP2K and libint were compiled with different
>>>>>>>>> LIBINT_MAX_AM.    *
>>>>>>>>>  *    |
>>>>>>>>>              *
>>>>>>>>>  *  O/|
>>>>>>>>>              *
>>>>>>>>>  * /| |
>>>>>>>>>              *
>>>>>>>>>  * / \
>>>>>>>>>  hfx_libint_wrapper.F:134 *
>>>>>>>>>
>>>>>>>>>  *******************************************************************************
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  ===== Routine Calling Stack =====
>>>>>>>>>
>>>>>>>>>             2 hfx_create
>>>>>>>>>             1 CP2K
>>>>>>>>>
>>>>>>>>> It seems like this problem is not present in the 7.1 version, as
>>>>>>>>> the program does not show it, and the compilation information does not
>>>>>>>>> show LIBINT_MAX_AM value...
>>>>>>>>>
>>>>>>>>> If somebody could give me some advice, I will appreciate it. :)
>>>>>>>>> I attach the input file, and the output file for 7.1 version.
>>>>>>>>>
>>>>>>>>> Regards - Lucas Lodeiro
>>>>>>>>>
>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "cp2k" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to cp... at googlegroups.com.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/cp2k/96479ce2-d8a3-4ccf-b55c-0e935878f1c0n%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/cp2k/96479ce2-d8a3-4ccf-b55c-0e935878f1c0n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "cp2k" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to cp... at googlegroups.com.
>>>>>
>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/cp2k/aa6b0a55-9d21-4da6-a3bb-f6f62ea0768bn%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/cp2k/aa6b0a55-9d21-4da6-a3bb-f6f62ea0768bn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "cp2k" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to cp... at googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/cp2k/87e96bc0-fa01-43ee-9d95-635a45db2d41n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/cp2k/87e96bc0-fa01-43ee-9d95-635a45db2d41n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "cp2k" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cp... at googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cp2k/693197b3-d4e9-4e27-835f-581fd656c3e6n%40googlegroups.com
> <https://groups.google.com/d/msgid/cp2k/693197b3-d4e9-4e27-835f-581fd656c3e6n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20201124/9fc7053d/attachment.htm>


More information about the CP2K-user mailing list