[CP2K-user] [CP2K:14269] Re: Hybrid functional calculation problem

Lucas Lodeiro eluni... at gmail.com
Sun Nov 22 22:07:48 UTC 2020
Previous message (by thread): [CP2K-user] [CP2K:14265] Re: Hybrid functional calculation problem
Next message (by thread): [CP2K-user] [CP2K:14269] Re: Hybrid functional calculation problem
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Fabian and Matt,

About the access to the memory, I ran calculations without problems for
months, using 90% of the node RAM without problems. But to check I set
ulimit -s unlimited. There are some changes, before using ulimit, the
calculation crashes and the use of RAM was so low (15%), after using
ulimit, the calculation crashes equally, but the use of RAM shows a
sustained rise to the limit and then the calculation crashes. This is a
change. I adjunct an image.

About the SCREEN_ON_INITIAL_P, I will use it in the little cluster. I like
the idea of running 2 calculations as climbing steps.

I know that the number of the ERIs calculated on the fly should be 0, and
if it is different from zero, I need to use more RAM to store them and to
not calculate them at each scf step. But in the case of the little cluster,
I am using all processors and RAM resources.  But the way, the
calculation runs without problems when ERIs calculated on the fly at each
scf step, just is very slow.

About what Matt comments. In the little cluster, I have a single node with
250GB RAM. Then I use MAX_MEMORY = 2600, this is a total of 166.4 GB for
the ERIS (the output informs 143 GB), and the rest for the whole program.
In the case of the big cluster, we have access to many nodes with 44 proc
and 192GB RAM, and 9 nodes with 44 proc and 768GB RAM. In the first case, I
use 5 nodes (220 proc) using all memory (960GB), setting MAX_MEMORY = 4000
(4.0 GB * 220 proc = 880 GB RAM for ERIs). In the second case, I use 5
nodes (220 proc) using all memory (3840GB), setting MAX_MEMORY = 15000
(15.0 GB * 220 proc = 3300 GB RAM for ERIs).
In both cases the calculation crashes... I do not know if I am so credulous
, but 3.3 TB of RAM seems, at least, enough to store so many of the ERIs...

Using the data informed in the output of little cluster:
  HFX_MEM_INFO| Number of sph. ERI's calculated:
4879985997918
  HFX_MEM_INFO| Number of sph. ERI's stored in-core:
 116452577779
  HFX_MEM_INFO| Number of sph. ERI's stored on disk:
    0
  HFX_MEM_INFO| Number of sph. ERI's calculated on the fly:
 4763533420139

The stored ERI's are the 1/42 of the total ERIs, and use 166.4 GB (143 GB
informed)... Then if I want to store all of them, I need 166.4 GB * 42 =
~7.0 TB... Is that correct?
I can get 7.0 TB RAM using 9 nodes with 768 GB RAM each one. But I am not
so clear about the idea that the amount of RAM is the problem, because in
the little cluster it runs, calculating almost all ERIs at each scf step...

I am a little surprised why the calculation runs in the little cluster, but
not in the big one.
Do you guess some other related problem?

Regards - Lucas



El dom, 22 nov 2020 a las 13:55, Matt W (<mattwa... at gmail.com>)
escribió:

> Your input has
>
>         &MEMORY
>           MAX_MEMORY           4000
>           EPS_STORAGE_SCALING  0.1
>         &END MEMORY
>
> This means that each MPI task (which can be multiple cores) should be able
> to allocate 4GBi of memory _exclusively_ for the 2 electron integrals.  If
> there is less than that available it will crash as the memory allocation
> can't occur. I guess your main cluster has less memory than the smaller
> one. You need to leave space for the operating system and the rest of the
> cp2k run besides the 2 electron integrals.
>
> There is another thread where Juerg answers HFX memory in more detail form
> earlier this year.
>
> Matt
>
> On Sunday, November 22, 2020 at 4:42:47 PM UTC fa... at gmail.com wrote:
>
>> Can cp2k access all the memory on the cluster? On linux you can use
>> ulimit -s unlimited
>> to remove any limit on the amount of memory a process can use.
>>
>> I usually use SCREEN_ON_INITIAL_P. I found that for large systems it is
>> faster to run two energy minimizations with the key word enabled (such that
>> the second restarts from a converged PBE0 wfn) than running a single
>> minimization without SCREEN_ON_INITIAL_P. But that probably depends on the
>> system you simulate.
>>
>> You should converge the cutoff with respect to the properties that you
>> are interested in. Run a test system with increasing cutoff and look at,
>> e.g. the energy, pdos, etc.
>>
>> Number of sph. ERI's calculated on the fly:        4763533420139
>> This number should always be 0. If it is larger, increase the memory cp2k
>> has available.
>>
>> Fabian
>> On Sunday, 22 November 2020 at 17:24:13 UTC+1 Lucas Lodeiro wrote:
>>
>>> Dear Fabian,
>>>
>>> Thanks for your advise. I forgot to clarify the time ejecution... my
>>> mistake.
>>> The calculation runs for 5 or 7 minutes, and stops... the walltime for
>>> the calculation was set as 72hrs, then I do not believe this is the
>>> problem. Now I am running the same input in a littler cluster (different
>>> form the problematic crash) with 64 proc and 250GB RAM, and the calculation
>>> works fine (so so slow, 9 hr per scf step, but runs... the total RAM
>>> assigned for the ERI's is not sufficient but the problem is not appear)...
>>> It is no practical to use this little cluster, then I need to fix the
>>> problem in the big one, to use more RAM and more processors (more than 220
>>> it is possible), but as the program does not show what is happening, I
>>> cannot tell anything to the cluster admin to recompile or fix the problem.
>>> :(
>>>
>>> This is the output in the little cluster:
>>>
>>>   Step     Update method      Time    Convergence         Total energy
>>>  Change
>>>
>>> ------------------------------------------------------------------------------
>>>
>>>   HFX_MEM_INFO| Est. max. program size before HFX [MiB]:
>>>    1371
>>>
>>>  *** WARNING in hfx_energy_potential.F:605 :: The Kohn Sham matrix is
>>> not  ***
>>>
>>>  *** 100% occupied. This may result in incorrect Hartree-Fock results.
>>> Try ***
>>>  *** to decrease EPS_PGF_ORB and EPS_FILTER_MATRIX in the QS section.
>>> For  ***
>>>  *** more information see FAQ: https://www.cp2k.org/faq:hfx_eps_warning
>>>    ***
>>>
>>>   HFX_MEM_INFO| Number of cart. primitive ERI's calculated:
>>> 27043173676632
>>>   HFX_MEM_INFO| Number of sph. ERI's calculated:
>>> 4879985997918
>>>   HFX_MEM_INFO| Number of sph. ERI's stored in-core:
>>>  116452577779
>>>   HFX_MEM_INFO| Number of sph. ERI's stored on disk:
>>>       0
>>>   HFX_MEM_INFO| Number of sph. ERI's calculated on the fly:
>>>  4763533420139
>>>   HFX_MEM_INFO| Total memory consumption ERI's RAM [MiB]:
>>>   143042
>>>   HFX_MEM_INFO| Whereof max-vals [MiB]:
>>>     1380
>>>   HFX_MEM_INFO| Total compression factor ERI's RAM:
>>>     6.21
>>>   HFX_MEM_INFO| Total memory consumption ERI's disk [MiB]:
>>>       0
>>>   HFX_MEM_INFO| Total compression factor ERI's disk:
>>>    0.00
>>>   HFX_MEM_INFO| Size of density/Fock matrix [MiB]:
>>>     266
>>>   HFX_MEM_INFO| Size of buffers [MiB]:
>>>      98
>>>   HFX_MEM_INFO| Number of periodic image cells considered:
>>>       5
>>>   HFX_MEM_INFO| Est. max. program size after HFX  [MiB]:
>>>    3834
>>>
>>>      1 NoMix/Diag. 0.40E+00 ******     5.46488333    -20625.2826573514
>>> -2.06E+04
>>>
>>> About the SCREEN_ON_INITIAL_P, I read that to use it, you need a very
>>> good guess (more than de GGA converged one) as for example the last step or
>>> frame from a GEO_OPT or MD... Is it really useful when the guess is the GGA
>>> wavefunction?
>>> About the CUTOFF_RADIUS, I read that 6 or 7 it is a good compromise, and
>>> as my cell is approximately twice, I use the minimal image convention to
>>> decide the 8.62 number which is near the recomended (6 or 7). If I reduce
>>> it, does the computational cost diminish considerably?
>>>
>>> Regards - Lucas
>>>
>>> El dom, 22 nov 2020 a las 12:53, fa... at gmail.com (<fa... at gmail.com>)
>>> escribió:
>>>
>>>> Dear Lucas,
>>>>
>>>> cp2k was computes the four-center integrals during (or prior) to the
>>>> first SCF cycle. I assume the job ran out of time during this task  For a
>>>> system with more than 1000 atoms this takes a lot of time. With only 220
>>>> CPUs this could take several hours in fact.
>>>>
>>>> To speed up the calculation you should use SCREEN_ON_INITIAL_P T and
>>>> restart using a well converged PBE wfn. Other than that, there is little
>>>> you can do other than assign the job more time and/or CPUs. (Of course,
>>>> reducing CUTOFF_RADIUS        8.62 would help too but could negatively
>>>> affect the result).
>>>>
>>>> Cheers,
>>>> Fabian
>>>>
>>>> On Sunday, 22 November 2020 at 01:21:05 UTC+1 Lucas Lodeiro wrote:
>>>>
>>>>> Hi all,
>>>>> I need to perform a hybrid calculation with CP2K7.1, over a big system
>>>>> (+1000 atoms). I study the manual, the tutorials and some videos of CP2K
>>>>> developers to improve my input. But the program exits the calculation when
>>>>> the HF part is running... I see the memory usage on the fly, and there is
>>>>> no peak which explains the fail (I used 4000Mb with 220 processors).
>>>>> The output does not show some explanation... Thinking in the memory, I
>>>>> try with a largemem node at our cluster, using 15000Mb with 220 processors,
>>>>> but the program exists at the same point without message, just killing the
>>>>> process.
>>>>> The output shows a warning:
>>>>>
>>>>>  *** WARNING in hfx_energy_potential.F:591 :: The Kohn Sham matrix is
>>>>> not  ***
>>>>>  *** 100% occupied. This may result in incorrect Hartree-Fock results.
>>>>> Try ***
>>>>>  *** to decrease EPS_PGF_ORB and EPS_FILTER_MATRIX in the QS section.
>>>>> For  ***
>>>>>  *** more information see FAQ:
>>>>> https://www.cp2k.org/faq:hfx_eps_warning    ***
>>>>>
>>>>> but I read this is not a very complicated issue, and the calculation
>>>>> has to continue and not crash
>>>>> Also I decrease the EPS__PGF_ORB, but the warning and the problem
>>>>> persist.
>>>>>
>>>>> I do not know if the problem could be located in other parts of my
>>>>> input... for example I use the PBE0-T_C-LR (I use PBC for XY), and ADMM. In
>>>>> the ADMM options, I use ADMM_PURIFICATION_METHOD = NONE, due to I read that
>>>>> ADMM1 is the only one useful for smearing calculations.
>>>>>
>>>>> I run this system with PBE (for the first guess of PBE0), and there is
>>>>> no problem in that case.
>>>>> Moreover, I try with other CP2K versions (7.0, 6.1 and 5.1) compiled
>>>>> into the cluster with (libint_max_am=6), and the calculation crash, but
>>>>> show this problem:
>>>>>
>>>>>
>>>>>  *******************************************************************************
>>>>>  *   ___
>>>>>         *
>>>>>  *  /   \
>>>>>          *
>>>>>  * [ABORT]
>>>>>         *
>>>>>  *  \___/       CP2K and libint were compiled with different
>>>>> LIBINT_MAX_AM.    *
>>>>>  *    |
>>>>>          *
>>>>>  *  O/|
>>>>>          *
>>>>>  * /| |
>>>>>          *
>>>>>  * / \
>>>>>  hfx_libint_wrapper.F:134 *
>>>>>
>>>>>  *******************************************************************************
>>>>>
>>>>>
>>>>>  ===== Routine Calling Stack =====
>>>>>
>>>>>             2 hfx_create
>>>>>             1 CP2K
>>>>>
>>>>> It seems like this problem is not present in the 7.1 version, as the
>>>>> program does not show it, and the compilation information does not
>>>>> show LIBINT_MAX_AM value...
>>>>>
>>>>> If somebody could give me some advice, I will appreciate it. :)
>>>>> I attach the input file, and the output file for 7.1 version.
>>>>>
>>>>> Regards - Lucas Lodeiro
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "cp2k" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to cp... at googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/cp2k/96479ce2-d8a3-4ccf-b55c-0e935878f1c0n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/cp2k/96479ce2-d8a3-4ccf-b55c-0e935878f1c0n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "cp2k" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cp... at googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/cp2k/aa6b0a55-9d21-4da6-a3bb-f6f62ea0768bn%40googlegroups.com
> <https://groups.google.com/d/msgid/cp2k/aa6b0a55-9d21-4da6-a3bb-f6f62ea0768bn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20201122/78d2ab5f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graph.png
Type: image/png
Size: 29564 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20201122/78d2ab5f/attachment.png>
Previous message (by thread): [CP2K-user] [CP2K:14265] Re: Hybrid functional calculation problem
Next message (by thread): [CP2K-user] [CP2K:14269] Re: Hybrid functional calculation problem
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the CP2K-user mailing list