[CP2K-user] Creating/Finding a small but realistic MD model for my research

Tue Boesen aly... at gmail.com
Tue May 18 15:38:46 UTC 2021

I’m completely new to cp2k and have only just installed it today, because I 
learned that it was used to generate the MD17 dataset, which I am 
interested in.

I’m currently starting up a neural network approach to molecular dynamics 
and for that I need a dataset. The ideal dataset for my research is 
essentially the MD17 dataset found here 1 
<http://www.quantum-machine.org/datasets/#md-datasets>: However, there is a 
problem with this dataset for my use-case, as quoted in the originating 
article, the MD17 dataset is created as:

"The data used for training the DFT models were created running abinitio MD 
in the NVT ensemble using the Nosé-Hoover ther- mostat at 500 K during a 
200 ps simulation with a resolution of 0.5 fs. We computed forces and 
energies using all-electrons at the generalized gradient approximation 
level of theory with the Perdew-Burke-Ernzerhof (PBE) 65 
exchange-correlation functional, treating van der Waals interactions with 
the Tkatchenko-Scheffler (TS) method 66 . All calculations were performed 
with FHI-aims 67 . The final training data was generated by subsampling the 
full trajectory under preservation of the Maxwell-Boltzmann distribution 
for the energies.
To create the coupled cluster datasets, we reused the same geometries as 
for the
DFT models and recomputed energies and forces using all-electron coupled 
with single, double, and perturbative triple excitations (CCSD(T)). The 
correlation-consistent basis set cc-pVTZ was used for ethanol, cc-pVDZ for 
and malonaldehyde and CCSD/cc-pVDZ for aspirin. All calculations were
performed with the Psi4 68 software suite."

So the data has been subsampled, meaning that the datapoints in the MD17 
dataset do not have the same time-step size between two following data 
samples, which is needed for my work.

So my question are:
Is there anyway of generating this dataset again given the above 
information? I have tried contacting the author, but haven’t heard anything 
back yet.

Or alternatively, are there any other simple systems like this available 
online or does anyone have any scripts/tutorial for how to generate a 
realistic molecular system dataset.
What I need are the atomic positions at each step, and ideally I would like 
the atomic velocities and Force vectors as well if possible. I would like 
to generate at least 100k-500k time-steps since I need quite a lot of data 
for the neural network training.

Any insight from experienced cp2k users or people in the field of molecular 
dynamics would be greatly appreciated.
