Training a Neural Network Potential#

To train a neural network potential (NNP) with dataset.pkl, write a configuration file train.yaml that specifies data preprocessing, model architecture, training protocol, and metrics. This page walks through a minimal SchNet setup. The full multi-architecture reference is enerzyme/config/train.yaml.

Datahub#

Datahub:

The Datahub section connects your dataset to the model. At minimum specify:

data_path — path to the dataset (relative paths are allowed)
data_format — pickle, npz, or hdf5

Then define features (model inputs) and targets (quantities to fit). Each key is a standard Enerzyme field name; the value is the attribute name in your dataset (leave empty if they match).

Standard Data Field Conventions#

Physical quantities use capitalized standard names. Atomic quantities end with a:

N — number of atoms
Ra — coordinates
Za — atomic numbers
E — total energy
Fa — forces
Qa — atomic charges
Q — total charge
M2 — dipole moment

You may rename dataset attributes freely as long as the mapping is consistent, e.g. Ra: xyz instead of Ra: coordinates.

Preprocessing the Dataset#

NNPs in Enerzyme are graph neural networks. Optional preprocessing improves efficiency and accuracy:

neighbor_list: full — precompute all-pairs neighbor lists before training
compressed: true — share Za, N, Q, and neighbor lists across frames of the same stoichiometry (same atom order)
preload: true — reuse cached HDF5 under processed_dataset_<hash>/ when config hash matches
transforms.negative_gradient: true — flip gradient sign when Fa stores QC gradients, not forces
transforms.atomic_energy — subtract per-atom reference energies from a CSV (atom_type, atomic_energy)

Note

The canonical config uses compressed, not compression. Older single-dataset YAML without datasets: still accepts the legacy layout; new projects should follow train.yaml.

Final Datahub Configuration#

Datahub:
    data_path: "dataset.pkl"
    data_format: "pickle"
    features:
        N: "number_of_atoms"
        Ra: "coordinates"
        Za: "atomic_numbers"
        Q: "total_charge"
    targets:
        E: "energy"
        Fa: "forces"
        Qa: "atomic_charges"
        M2: "dipole"
    neighbor_list: full
    compressed: true
    preload: true
    transforms:
        atomic_energy: "atomic_energy.csv"
        negative_gradient: true

Modelhub#

Models live under internal_FFs (built into Enerzyme) or external_FFs (NequIP, XPaiNN, etc.). Each model has a unique ID (e.g. FF01).

Choosing an architecture#

Architecture	Good for	Notes
SchNet	First tutorial / baseline	Charge + dipole capable
PhysNet	Production charge-aware PES	Electrostatics, D3 layers
SpookyNet	Large organic / mixed systems	Similar feature set
MACE	Equivariant accuracy	Higher compute cost
NequIP	External equivariant model	Requires `nequip`
XPaiNN	External XPaiNN via XequiNet	Extra pip packages

Enable exactly one model (active: true) when starting out.

SchNet configuration#

SchNet is fully internal—no extra pip packages. Key entries:

architecture: SchNet
build_params — dim_embedding, num_rbf, max_Za, cutoff_sr, Hartree_in_E, Bohr_in_R
layers — modular stack ending with Force for analytic forces
loss — weighted sum over targets (convert force weights to your energy/length units)
Metric — per-model validation metric for early stopping (same weights as loss is common)

Note

Hartree_in_E and Bohr_in_R convert internal atomic units to your dataset units. For Ha and Å, use 1 and 0.5291772108. For eV and Å, use ~27.2 and ~0.529.

Final Modelhub configuration#

Modelhub:
    internal_FFs:
        FF01:
            suffix:
            architecture: SchNet
            active: true
            build_params:
                dim_embedding: 128
                num_rbf: 128
                max_Za: 94
                cutoff_sr: 5.0
                Hartree_in_E: 1
                Bohr_in_R: 0.5291772108
            layers:
              - name: RangeSeparation
              - name: GaussianSmearing
              - name: RandomAtomEmbedding
              - name: Core
                params:
                    num_interactions: 4
                    hidden_channels: 128
              - name: AtomicAffine
              - name: ChargeConservation
              - name: AtomicCharge2Dipole
              - name: ElectrostaticEnergy
              - name: EnergyReduce
              - name: Force
            loss:
                rmse:
                    Fa: 52.917721
                    Qa: 1
                    E: 1
                    M2: 1.8897261
                    Q: 1
            Metric:
                E:
                    rmse: 1
                Fa:
                    rmse: 52.917721
                Qa:
                    rmse: 1
                M2:
                    rmse: 1.8897261
                Q:
                    rmse: 1

Trainer#

Splitter#

Random splitting is configured under Trainer.Splitter:

parts — partition names (at least training and validation)
ratios — fractions or absolute counts, in the same order as parts
seed, save, preload — reproducibility and index caching

Training loop#

Each epoch loads mini-batches of batch_size using num_workers processes. Training stops at max_epochs or when validation judge_score fails to improve for patience epochs.

Optimizer and scheduler#

Enerzyme uses Adam with a linear warmup scheduler (schedule: linear, warmup_ratio).

System settings#

Set cuda: true for GPU training and dtype: float32 (or float64). Set seed for reproducibility.

Caution

CUDA nondeterminism may still cause run-to-run differences even with a fixed seed.

Final Trainer configuration#

Trainer:
    Splitter:
        method: random
        parts:
        - training
        - validation
        - test
        preload: true
        ratios:
        - 0.7
        - 0.1
        - 0.2
        save: true
        seed: 42
    batch_size: 64
    cuda: true
    dtype: float32
    schedule: linear
    learning_rate: 0.001
    max_epochs: 10000
    num_workers: 10
    patience: 50
    seed: 42
    warmup_ratio: 0.001

Running the training job#

enerzyme train -c train.yaml -o .

Output artifacts#

After training, the output directory typically contains:

config.yaml — resolved configuration (keep this for predict/simulate)
processed_dataset_<hash>/ — preprocessed HDF5 cache
logs/ — training logs, metrics, early-stopping traces
FF01/ (or your model ID) — best/ and last/ checkpoints

Use enerzyme/config/train.yaml when you need multi-dataset Datahub, external models, EMA, Lightning multi-GPU, or pretraining paths.