[CP2K-user] [CP2K:19917] Re: FFTW_PLAN_TYPE PATIENT and EXHAUSTIVE not working

Léon Luntadila Lufungula Leon.luntadilalufungula at uantwerpen.be
Fri Feb 16 10:57:40 UTC 2024


Dear Frederick,

1) Thanks for the detailed explanation regarding the CP2K timings! I have 
optimized the parameters you mention for my static DFT calculations where I 
use CUTOFF=800, REL_CUTOFF=40, EPS_DEFAULT=1.0E-12 and EPS_SCF=1.0E-7, 
giving a difference of 1E-5 a.u. w.r.t. a reference calculation with 
CUTOFF=4000 and REL_CUTOFF=80. For my AIMD calculations I also tested the 
conservation of energy during an NVE calculation and got excellent 
conservation for CUTOFF=400, REL_CUTOFF=40, EPS_DEFAULT=1.0E-10 and 
EPS_SCF=1.0E-7, increasing EPS_DEFAULT to 1.0E-12 didn't seem to change 
much except for the time required to perform an MD step(7.5s vs. 8.2s). 
[image: def10.png][image: def12.png]
Fortunately, I have already tried FFTW_PLAN_TYPE MEASURE a couple of days 
ago when posting my initial question (see timings below), but this didn't 
give much of an improvement (21820 steps vs. 21590 steps). 

 -------------------------------------------------------------------------------
 -                                                                         
    -
 -                                T I M I N G                               
   -
 -                                                                         
    -
 -------------------------------------------------------------------------------
 SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL 
TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE 
 MAXIMUM
 CP2K                                 1  1.0      0.0      0.0 259200.9 
259202.7
 qs_mol_dyn_low                       1  2.0      0.5      0.6 259200.6 
259202.4
 velocity_verlet                  21829  3.0      4.7      5.3 259116.7 
259118.8
 qs_forces                        21830  4.0      6.7      7.9 259052.2 
259054.9
 qs_energies                      21830  5.0      1.3      1.8 217472.4 
217487.7
 scf_env_do_scf                   21830  6.0      0.7      1.0 198082.4 
198085.3
 scf_env_do_scf_inner_loop       174656  7.0      3.0      3.7 147728.9 
147733.7
 rebuild_ks_matrix               196486  8.7      0.6      0.7 112527.8 
112729.8
 qs_ks_build_kohn_sham_matrix    196486  9.7     21.8     25.5 112527.2 
112729.1
 qs_ks_update_qs_env             196486  8.0      1.4      1.7  85161.6 
 85342.5
 sum_up_and_integrate            196486 10.7    160.0    209.2  54201.8 
 54293.4
* pw_transfer                    4148036 12.9    263.2    304.5  52714.9 
 54167.5*
 integrate_v_rspace              196486 11.7      7.0     10.2  54041.3 
 54133.1

* fft_wrap_pw1pw2                3755064 14.0     42.4     47.1  52224.1 
 53695.2 fft_wrap_pw1pw2_200            2183176 15.4   5174.7   5626.2 
 49517.7  51266.7*
 init_scf_loop                    21830  7.0      0.6      0.7  50261.2 
 50262.9
 qs_vxc_create                   196486 10.7      4.7      5.6  48643.3 
 49120.6
 xc_vxc_pw_create                196486 11.7   1000.6   1625.3  48638.7 
 49116.1
 qs_rho_update_rho_low           196486  8.1      0.8      0.9  44051.5 
 44126.2
 calculate_rho_elec              196486  9.1    110.3    185.9  44050.7 
 44125.4
 dbcsr_multiply_generic         3012817 12.6    132.8    140.6  42612.9 
 43191.8
 fft3d_ps                       3755064 16.0   9552.3  14496.2  41234.7 
 42520.7
 prepare_preconditioner           21830  8.0      0.1      0.2  38987.3 
 39009.8
 grid_integrate_task_list        196486 12.7  34732.0  37662.0  34732.0 
 37662.0
 make_preconditioner              21830  9.0      0.4      0.4  36853.2 
 36863.4
 xc_pw_derive                   1178916 13.7      9.7     11.2  34115.8 
 36015.3
 qs_scf_new_mos                  174656  8.0      0.9      1.3  34716.3 
 34849.9
 qs_scf_loop_do_ot               174656  9.0      0.8      1.0  34715.4 
 34849.1
 make_full_all                    21830 10.0      4.9      6.4  34635.3 
 34645.4
 ot_scf_mini                     174656 10.0      4.0      4.6  31757.0 
 31879.2
 multiply_cannon                3012817 13.6    193.3    227.5  29505.4 
 31072.7
 mp_alltoall_z22v               3755064 18.0  27816.9  30306.8  27816.9 
 30306.8
 multiply_cannon_loop           3012817 14.6    217.0    235.4  27788.4 
 29096.5
 xc_rho_set_and_dset_create      196486 12.7    241.7    271.6  20750.7 
 27765.4
 qs_ks_update_qs_env_forces       21830  5.0      0.1      0.2  27457.9 
 27479.8
 xc_pw_divergence                196486 12.7      4.0      5.2  25840.6 
 27415.2
 ot_mini                         174656 11.0      1.0      1.3  23884.9 
 24039.5
 grid_collocate_task_list        196486 10.1  22773.0  23579.2  22773.0 
 23579.2
 rs_pw_transfer                 2008520 12.3     42.1     49.3  20604.4 
 21564.4
 yz_to_x                        1004260 17.7    746.8   1231.3  19488.5 
 21530.0
 density_rs2pw                   196486 10.1      8.2      9.2  19637.2 
 20954.1
 mp_waitall_1                   ******* -4.1  17488.2  19251.7  17488.2 
 19251.7
 multiply_cannon_multrec        ******* 15.6  17302.7  18334.2  17316.4 
 18348.7
 cp_fm_diag_elpa                  65490 11.3      0.4      0.5  18289.1 
 18296.1
 cp_fm_diag_elpa_base             65490 12.3  18094.9  18211.7  18274.5 
 18274.9
 xc_functional_eval              196486 13.7      2.0      2.9   8365.0 
 14324.5
 pbe_lda_eval                    196486 14.7   8363.0  14323.4   8363.0 
 14323.4
 mp_waitany                     ******* 14.1  12907.9  14030.4  12907.9 
 14030.4
 build_core_hamiltonian_matrix_   21830  5.0      0.7      0.9  10163.4 
 12916.0
 potential_pw2rs                 196486 12.7     11.0     12.1  12811.3 
 12862.1
 apply_preconditioner_dbcsr      196486 13.0      0.4      0.6  12274.1 
 12495.7
 apply_all                       196486 14.0     13.8     14.5  12273.7 
 12495.3
 qs_ot_get_derivative            174656 12.0      1.0      1.1  12107.0 
 12231.2
 ot_diis_step                    174656 12.0      6.1      7.4  11670.0 
 11670.4
 multiply_cannon_metrocomm3     ******* 15.6     90.4    100.6   7449.1 
 11410.0
 init_scf_run                     21830  6.0      0.3      0.4  10764.8 
 10766.3
 scf_env_initial_rho_setup        21830  7.0      0.2      0.2  10764.4 
 10765.8
 cp_fm_cholesky_reduce            21830 11.0  10738.1  10744.9  10738.1 
 10744.9
 wfi_extrapolate                  21830  8.0      1.4      1.6  10707.6 
 10707.8
 mp_alltoall_d11v               4671757 13.6   8744.4  10633.3   8744.4 
 10633.3
 x_to_yz                        1178916 17.2   1200.0   1689.5   9900.2 
 10462.4
 make_m2s                       6025634 13.6    115.7    153.5   9932.7 
 10333.9
 rs_pw_transfer_RS2PW_200        218316 11.9    900.4   1148.9   8610.3   
9435.7
 make_images                    6025634 14.6    244.1    262.0   8865.5   
9141.7
 mp_sum_d                       3821027 12.6   5682.4   7874.9   5682.4   
7874.9
 qs_energies_init_hamiltonians    21830  6.0      0.4      0.5   7800.9   
7801.2
 rs_gather_matrices              196486 12.7    112.7    159.0   6365.5   
7362.5
 make_images_data               6025634 15.6     59.0     83.4   6337.1   
7169.9
 hybrid_alltoall_any            6396752 16.4     83.7    763.9   5601.4   
7058.6
 qs_ot_get_derivative_taylor     174656 13.0      1.9      2.1   6380.8   
6497.4
 build_core_ppnl_forces           21830  6.0   5341.1   6203.0   5341.1   
6203.0
 multiply_cannon_metrocomm1     ******* 15.6     48.1     52.4   2392.5   
5507.1

2) I just checked which version of FFTW was in the foss-2023a toolchain 
used by Easybuild to compile CP2K2023.2 and it is apparently the 
3.3.10-version (see https://docs.easybuild.io/common-toolchains/). The 
other compilation on my HPC center was a 7.1 version of CP2K compiled with 
the intel-2020a toolchain, where FFTW_PLAN_TYPE PATIENT did work, but I 
can't seem to figure out which version of FFTW is used (see attached output 
of calculation). So I'm not quite sure if the thing that makes the other 
compilation work is the version of FFTW or the version of CP2K itself. I 
don't have experience with compiling CP2K either manually or through 
esybuild, so I wouldn't know how to test either hypothesis...
[image: Screenshot 2024-02-16 115704.png]

Kind regards,
Léon
On Friday 16 February 2024 at 10:49:52 UTC+1 Frederick Stein wrote:

> Dear Léon,
> Regarding 1:
> Did you optimize the accuracy-relevant parameters (CUTOFF, EPS_DEFAULT) to 
> meet your demands regarding energies and gradients? If not, there is an how 
> to concerning this topic (https://www.cp2k.org/howto:converging_cutoff). 
> Choose the cutoff as high as necessary and as low as possible.
> TLDR: From your timing report, I recommend you to employ the MEASURE mode, 
> see below for a more lengthy explanation.
> If you have optimized these parameters, check the timing report. The total 
> CP2K runtime is given by the line with "CP2K" (ignoring warming up times of 
> the cluster and of CP2K). Each line represents a routine within CP2K, from 
> the name you may guess its purpose/task. Relevant to you are especially the 
> last four columns. The last two columns indicate the average and maximum 
> time spent in a given routine and its called routines. They give you an 
> indication which functionality spends the most time. The fourth and third 
> last columns indicate the the average and maximum time spent in the routine 
> itself, i.e. without the routines called by the given routine. These two 
> columns indicate in which compute kernel CP2K spends its most amount of 
> time. If you check the routines with the largest self-time, you find the 
> highest self-times of FFT-related routines, i.e. those containing 'fft' in 
> their names, in case of fft3d_ps (10122.9 s, 3.9 % of CP2K) and fft_wrap_pw1pw2_200 
> (5123.8 s, 2.0 % of CP2K). Even if you were able to reduce the timing of 
> these routines to zero, you would gain ca. 6 % of computation time, in 
> practice it will be much less. So, I think that a further reduction would 
> not be able to reduce the computational time significantly.
> For completeness, the difference to the total runtime of the 
> fft_wrap_pw1pw2 routine of ~50000 s is related to the respective 
> communication routines which do not have 'fft' in their name. These are of 
> course not affected by the FFTW planning times.
>
> Regarding 2:
> Checking the source code of the routine fft_get_scratch in 
> src/pw/fft_tools.F indicates that the issue might be related to an issue on 
> a lower level routine within CP2K (probably src/fft/fftw3_lib.F) or in 
> FFTW3 directly. I did some local tests with one of the single-node tests 
> and did not observe any issues in these modes. So, the issue will need more 
> thorough investigations.
>
> HTH
> Frederick
> Léon Luntadila Lufungula schrieb am Freitag, 16. Februar 2024 um 10:01:04 
> UTC+1:
>
>> Dear Frederick,
>>
>> Thanks for guiding me in solving this issue!
>>
>> 1) I have not yet checked if this is indeed the case but I'm quite new to 
>> AIMD calculations and the reference manual stated that this was recommended 
>> for long AIMD trajectories, so I assumed this would be a good option to 
>> enable. If you could show me how to check this, I would greatly appreciate 
>> the help! For the same calculation with the default FFTW_PLAN_TYPE ESTIMATE 
>> I get the following timings:
>>
>>
>>  -------------------------------------------------------------------------------
>>  -                                                                       
>>       -
>>  -                                T I M I N G                             
>>      -
>>  -                                                                       
>>       -
>>
>>  -------------------------------------------------------------------------------
>>  SUBROUTINE                       CALLS  ASD         SELF TIME       
>>  TOTAL TIME
>>                                 MAXIMUM       AVERAGE  MAXIMUM  AVERAGE 
>>  MAXIMUM
>>  CP2K                                 1  1.0      0.0      0.0 259198.3 
>> 259201.0
>>  qs_mol_dyn_low                       1  2.0      0.5      0.6 259198.0 
>> 259200.7
>>  velocity_verlet                  21600  3.0      4.7      5.3 259117.8 
>> 259120.7
>>  qs_forces                        21601  4.0      6.7      8.3 259054.1 
>> 259057.9
>>  qs_energies                      21601  5.0      1.3      2.0 216944.6 
>> 216962.9
>>  scf_env_do_scf                   21601  6.0      0.7      1.0 197423.7 
>> 197426.9
>>  scf_env_do_scf_inner_loop       172819  7.0      2.9      3.7 147536.9 
>> 147542.8
>>  rebuild_ks_matrix               194420  8.7      0.7      0.7 113213.6 
>> 113417.8
>>  qs_ks_build_kohn_sham_matrix    194420  9.7     21.9     25.3 113212.9 
>> 113417.2
>>  qs_ks_update_qs_env             194420  8.0      1.5      1.8  85131.3 
>>  85315.6
>> * pw_transfer                    4104421 12.9    259.1    293.8  53272.8 
>>  54657.8*
>>  sum_up_and_integrate            194420 10.7    158.9    210.4  54433.6 
>>  54518.3
>>  integrate_v_rspace              194420 11.7      6.8      8.9  54274.2 
>>  54375.6
>>  
>> *fft_wrap_pw1pw2                3715581 14.0     41.8     46.1  52788.9 
>>  54160.2 fft_wrap_pw1pw2_200            2160221 15.4   5123.8   5554.3 
>>  50067.4  51727.8*
>>  init_scf_loop                    21601  7.0      0.6      0.8  49795.6 
>>  49797.8
>>  qs_vxc_create                   194420 10.7      4.5      5.3  49235.6 
>>  49729.8
>>  xc_vxc_pw_create                194420 11.7    990.7   1608.7  49231.1 
>>  49725.4
>>  qs_rho_update_rho_low           194420  8.1      0.8      0.9  44519.7 
>>  44594.7
>>  calculate_rho_elec              194420  9.1    109.2    184.2  44518.9 
>>  44593.9
>>  fft3d_ps                       3715581 16.0  10122.9  14717.9  41927.7 
>>  43177.5
>>  dbcsr_multiply_generic         2981151 12.6    130.7    138.3  42145.8 
>>  42708.1
>>  prepare_preconditioner           21601  8.0      0.1      0.2  38552.7 
>>  38578.0
>>  grid_integrate_task_list        194420 12.7  34380.1  37401.8  34380.1 
>>  37401.8
>>  xc_pw_derive                   1166520 13.7      9.8     11.0  34752.0 
>>  36587.5
>>  make_preconditioner              21601  9.0      0.4      0.6  36451.6 
>>  36464.5
>>  qs_scf_new_mos                  172819  8.0      0.9      1.4  34340.9 
>>  34469.5
>>  qs_scf_loop_do_ot               172819  9.0      0.8      1.0  34340.0 
>>  34468.7
>>  make_full_all                    21601 10.0      4.6      5.3  34250.9 
>>  34260.8
>>  ot_scf_mini                     172819 10.0      4.0      4.7  31468.4 
>>  31598.1
>>  multiply_cannon                2981151 13.6    192.3    222.5  29228.9 
>>  30932.4
>>  mp_alltoall_z22v               3715581 18.0  27949.9  30346.5  27949.9 
>>  30346.5
>>  multiply_cannon_loop           2981151 14.6    215.7    231.9  27541.2 
>>  28832.2
>>  xc_pw_divergence                194420 12.7      4.0      4.7  26628.5 
>>  28220.2
>>  qs_ks_update_qs_env_forces       21601  5.0      0.1      0.2  28174.2 
>>  28195.0
>>  xc_rho_set_and_dset_create      194420 12.7    239.1    269.1  20595.1 
>>  28022.0
>>  ot_mini                         172819 11.0      1.0      1.4  23618.8 
>>  23759.5
>>  grid_collocate_task_list        194420 10.1  22590.9  23442.1  22590.9 
>>  23442.1
>>  rs_pw_transfer                 1987402 12.3     41.9     48.7  21225.8 
>>  22117.5
>>  yz_to_x                         993701 17.7    742.7   1149.4  19795.7 
>>  21777.5
>>  density_rs2pw                   194420 10.1      8.1      8.9  20317.1 
>>  21233.4
>>  mp_waitall_1                   ******* -4.3  17314.7  19114.1  17314.7 
>>  19114.1
>>  multiply_cannon_multrec        ******* 15.6  17135.6  18128.8  17149.2 
>>  18142.9
>>  cp_fm_diag_elpa                  64803 11.3      0.4      0.5  18118.8 
>>  18125.4
>>  cp_fm_diag_elpa_base             64803 12.3  17919.4  18033.2  18104.5 
>>  18104.8
>>  mp_waitany                     ******* 14.1  13555.6  14684.3  13555.6 
>>  14684.3
>>  xc_functional_eval              194420 13.7      2.0      2.6   8289.8 
>>  14473.2
>>  pbe_lda_eval                    194420 14.7   8287.8  14472.3   8287.8 
>>  14472.3
>>
>>
>> 2) Unfortunately, no pdbg-version was compiled on my HPC center, but I 
>> did run a calculation with TRACE and TRACE_MASTER enabled. See attached 
>> files for the output.
>>
>> Kind regards,
>> Léon
>> On Friday 16 February 2024 at 09:36:56 UTC+1 Frederick Stein wrote:
>>
>>> Dear Leon,
>>>
>>> I do not know the root of the error and I cannot suggest a solution. The 
>>> options themself are tested within our regtest suite and we do not find any 
>>> issues there. So, it seems to be a more complicated problem either on FFTW 
>>> site or on CP2K site.
>>>
>>> I have two questions:
>>>
>>> 1. Did you check whether the FFT kernel actually needs an improvement? 
>>> Check the runtime of the routines pw_transfer and those starting with 
>>> fft_wrap_pw1pw2 (or similar).
>>>
>>> 2. Do you have a pdbg-version of CP2K available? If yes, can you run one 
>>> of the failing tests with that one? It might also help to turn on the 
>>> keywords TRACE and TRACE_MASTER in the GLOBAL section of your input files 
>>> to identify the actual routine on CP2K site where the error occurs.
>>>
>>> Regards,
>>>
>>> Frederick
>>>
>>> Léon Luntadila Lufungula schrieb am Montag, 12. Februar 2024 um 10:42:16 
>>> UTC+1:
>>>
>>>> Dear all,
>>>>
>>>> I've been running some AIMD calculations and am trying to speed up the 
>>>> calculations a bit by playing with the FFTW_PLAN_TYPE option. 
>>>> Unfortunately, only MEASURE and the default ESTIMATE are working. If I try 
>>>> to set it to PATIENT (as recommended for long AIMD runs) or EXHAUSTIVE, the 
>>>> calculation crashes almost immediately with the following error messages 
>>>> (see also attached files):
>>>>
>>>> [PATIENT]
>>>>
>>>> ...
>>>>
>>>> corrupted double-linked list
>>>>
>>>> corrupted double-linked list (not small)
>>>>
>>>> cp2k.popt: malloc.c:4106: _int_malloc: Assertion `(unsigned long) 
>>>> (size) >= (unsigned long) (nb)' failed.
>>>>
>>>>  
>>>>
>>>> Program received signal SIGABRT: Process abort signal.
>>>>
>>>> ...
>>>>
>>>> [EXHAUSTIVE]
>>>>
>>>> ...
>>>>
>>>> malloc_consolidate(): unaligned fastbin chunk detected
>>>>
>>>> malloc_consolidate(): unaligned fastbin chunk detected
>>>>
>>>>  
>>>>
>>>> Program received signal SIGABRT: Process abort signal.
>>>>
>>>>  
>>>>
>>>> Backtrace for this error:
>>>>
>>>> corrupted double-linked list
>>>>
>>>>  
>>>>
>>>> Program received signal SIGABRT: Process abort signal.
>>>>
>>>>  
>>>>
>>>> Backtrace for this error:
>>>>
>>>> malloc_consolidate(): unaligned fastbin chunk detected
>>>>
>>>>  
>>>>
>>>> Program received signal SIGABRT: Process abort signal.
>>>>
>>>>  
>>>>
>>>> Backtrace for this error:
>>>>
>>>>  
>>>>
>>>> Program received signal SIGABRT: Process abort signal.
>>>>
>>>>  
>>>>
>>>> Backtrace for this error:
>>>>
>>>> corrupted double-linked list
>>>>
>>>> cp2k.popt: malloc.c:4106: _int_malloc: Assertion `(unsigned long) 
>>>> (size) >= (unsigned long) (nb)' failed.
>>>>
>>>> ...
>>>>
>>>> I'm running CP2K/2023.2-foss-2022a as compiled with Easybuild by our 
>>>> HPC centre, but the same problems appear when I try the 
>>>> CP2K/2022.1-foss-2022a version. However, when I run it with the 
>>>> CP2K/7.1-intel-2020a version which is also available, both EXHAUSTIVE and 
>>>> PATIENT seem to be working properly... Is this something that can be solved 
>>>> in some way or will this require a different compilation of CP2K, possibly 
>>>> with the intel toolchain instead of the foss toolchain? 
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Léon
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+unsubscribe at googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cp2k/a8509a11-6975-4ad7-a811-5b2ad34327b8n%40googlegroups.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20240216/eae1658b/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: def12.png
Type: image/png
Size: 48811 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20240216/eae1658b/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: a101-solv40.md-NPT_F-eq-patient-7.1.inp
Type: chemical/x-gamess-input
Size: 6168 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20240216/eae1658b/attachment-0001.inp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: a101-solv40.md-NPT_F-eq-patient-7.1.out
Type: application/octet-stream
Size: 31950 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20240216/eae1658b/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: a101-solv40.md-NPT_F-eq-patient-7.1.1356778.slurm
Type: application/x-shellscript
Size: 8401 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20240216/eae1658b/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: def10.png
Type: image/png
Size: 49012 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20240216/eae1658b/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot 2024-02-16 115704.png
Type: image/png
Size: 76911 bytes
Desc: not available
URL: <https://lists.cp2k.org/archives/cp2k-user/attachments/20240216/eae1658b/attachment-0005.png>


More information about the CP2K-user mailing list