Preparing a Neural Network Potential Dataset#

To train a neural network potential (NNP), you need a dataset of atomic systems and their labels. Enerzyme supports pickle, npz, and hdf5 formats.

Format	When to use
`pickle`	Quick start; list of Python dicts (most intuitive)
`npz`	NumPy-native storage; good for large numeric arrays
`hdf5`	Efficient random access; used internally after preprocess

This page focuses on pickle as the entry point. The same field naming and Datahub mappings apply to all formats.

Datapoint schema#

A pickle dataset is a list of datapoints. Each datapoint is a dictionary of attribute name–value pairs. A typical QM-labeled entry includes:

number_of_atoms — integer \(N\)
coordinates — float array of shape (N, 3) in Å (or your chosen length unit)
atomic_numbers — integer array of shape (N,)
energy — scalar total energy
forces — float array of shape (N, 3) (or raw QC gradients; see below)
atomic_charges — float array of shape (N,)
total_charge — integer (defaults to 0 if omitted)
dipole — float array of shape (3,)

Note

coordinates and atomic_numbers define the system. number_of_atoms can be inferred from atomic_numbers. For PES learning, energy and/or forces are required targets. Additional fields depend on your task and model architecture.

Standard field mapping#

Enerzyme maps your attribute names to internal standard names in the training YAML (see Training a Neural Network Potential). Common mappings:

Your attribute	Standard name
coordinates	`Ra`
atomic_numbers	`Za`
number_of_atoms	`N`
energy	`E`
forces	`Fa`
atomic_charges	`Qa`
total_charge	`Q`
dipole	`M2`

Units and gradients#

Caution

Forces vs. gradients. Quantum chemistry packages often output energy gradients \(\nabla E\), not forces. Forces are \(F = -\nabla E\). Set negative_gradient: true in Datahub transforms when your Fa targets are raw gradients.

Note

TeraChem gradients. The helper script scripts/picklizer.py converts TeraChem gradient files from Ha/Bohr to Ha/Å by dividing by 0.5291772108. Keep units consistent with Hartree_in_E and Bohr_in_R in your model config.

Building a dataset from TeraChem outputs#

The repository includes scripts/picklizer.py for grouping TeraChem output files into a pickle. Each entry in file_lists is a dict pointing to per-structure files:

from scripts.picklizer import picklizer

file_lists = [
    {
        "coord": "run001/structure.xyz",
        "grad": "run001/grad.xyz",
        "chrg": "run001/mulliken.chrg",
        "dipole": "run001/dipole.txt",
    },
    # ...
]
picklizer(file_lists, output="dataset.pkl", flavor="terachem", provide_Q=-1)

The resulting datapoints use keys coord, grad, chrg, dipole, total_chrg. Map them in your YAML, for example:

features:
    Ra: coord
    Za: atom_type
    Q: total_chrg
targets:
    E: energy
    Fa: grad
    Qa: chrg
    M2: dipole
transforms:
    negative_gradient: true

Generic pickle builder#

If you already have a parser for your QM package:

import pickle
from my_script import parse_qm_output, find_qm_outputs

datapoints = []
for qm_output in find_qm_outputs():
    parsed_data = parse_qm_output(qm_output)
    datapoints.append({
        'number_of_atoms': parsed_data['number_of_atoms'],
        'coordinates': parsed_data['coordinates'],
        'atomic_numbers': parsed_data['atomic_numbers'],
        'energy': parsed_data['energy'],
        'forces': parsed_data['forces'],
        'atomic_charges': parsed_data['atomic_charges'],
        'total_charge': parsed_data['total_charge'],
        'dipole': parsed_data['dipole'],
    })

with open('dataset.pkl', 'wb') as f:
    pickle.dump(datapoints, f)

Preprocess without training#

To only preprocess and split a dataset (write HDF5 cache and partition indices) without starting training:

enerzyme collect -c train.yaml -o .

Use the same Datahub and Trainer.Splitter sections as in a training config. This is useful to validate mappings and inspect processed_dataset_<hash>/ before a long run.

Security and compatibility#

Danger

Pickle files are not secure. Do not load pickles from untrusted sources.

Caution

Pickle compatibility depends on Python and library versions. Loading a file created with NumPy 2.x under NumPy 1.x may raise ModuleNotFoundError: No module named numpy._core.