Model backbone
The designed model backbone can predict the equilibrium conformation and QC property simultaneously, denoted as \((y,\, {\hat{{\boldsymbol{r}}}})=f({{\boldsymbol{X}}},\, {{\boldsymbol{E}}},\, {{\boldsymbol{r}}};{{\boldsymbol{\theta }}})\). The model takes three inputs, (i) atom features (\({{\boldsymbol{X}}}\in {{\mathbb{R}}}^{n\times {d}_{f}}\), where n is the number of atoms and df is atom feature dimension), (ii) edge features (\({{\boldsymbol{E}}}\in {{\mathbb{R}}}^{n\times n\times {d}_{e}}\), where de is the edge feature dimension), and (iii) 3D coordinates of atoms (\({{\boldsymbol{r}}}\in {{\mathbb{R}}}^{n\times 3}\)). θ is the set of learnable parameters. And the model predicts a quantum property y and updated 3D coordinates \({\hat{{\boldsymbol{r}}}}\in {{\mathbb{R}}}^{n\times 3}\).
As illustrated in Fig. 1b, the L-block model maintains two distinct representation tracks: atom representation and pair representation. The atom representation is denoted as \({{\boldsymbol{x}}}\in {{\mathbb{R}}}^{n\times {d}_{x}}\), where dx represents the dimension of the atom representation. Similarly, the pair representation is denoted as \({{\boldsymbol{p}}}\in {{\mathbb{R}}}^{n\times n\times {d}_{p}}\), where dp signifies the dimension of the pair representation. The model comprises L blocks, with x(l) and p(l) representing the output representations of the l-th block. Within each block, the atom representation is initially updated through self-attention, incorporating an attention bias derived from the pair representation, followed by an update via a feed-forward network (FFN). Concurrently, the pair representation undergoes a series of updates, beginning with an outer product of the atom representation (referred to OuterProduct), followed by triangular multiplication (referred to TriangularUpdate) as implemented in AlphaFold218, and finally, an update using a FFN. This backbone, in comparison to the one used in Uni-Mol17, enhances the pair representation through two key improvements: (i) employing an outer product for effective atom-to-pair communication, and (ii) utilizing a triangular operator to bolster the 3D geometric information. Next we will introduce each module in detail.
Positional encoding
Similar to previous works4,17, we use pair-wise encoding to encode the 3D spatial and 2D graph positional information. Specifically, for 3D spatial information, we utilize the Gaussian kernel for encoding, as done in previous studies5,17. The encoded 3D spatial positional encoding is denoted by ψ3D.
In addition to the 3D positional encodings, we also incorporate graph positional encodings similar to those used in Graphormer. This includes the shortest-path encoding, represented by \({{{\boldsymbol{\psi }}}}_{i,j}^{{{\rm{SP}}}}={\mathtt{Embedding}}\,({{{\rm{sp}}}}_{ij})\) where spij is the shortest path between atoms (i, j) in the molecular graph. Additionally, instead of the time-consuming multi-hop edge encoding method used in Graphormer, we utilize a more efficient one-hop bond encoding, denoted by \({{{\boldsymbol{\psi }}}}^{{{\rm{Bond}}}}={\sum }_{i=1}^{{d}_{e}}{\mathtt{Embedding}}\,({{{\boldsymbol{E}}}}_{i})\), where Ei is the i-th edge feature. Combined above, the positional encoding is denoted as ψ = ψ3D + ψSP + ψBond. And the pair representation p is initialized by ψ, i.e., p(0) = ψ.
Update of atom representation
The atom representation x(0) is initialized by the embeddings of atom features, the same as Graphormer. At l-th block, x(l) is sequentially updated as follow:
$${{{\boldsymbol{x}}}}^{(l)}=\, {{{\boldsymbol{x}}}}^{(l-1)}+{\mathtt{SelfAttentionPairBias}}\,\left({{{\boldsymbol{x}}}}^{(l-1)},\, {{{\boldsymbol{p}}}}^{(l-1)}\right),\\ {{{\boldsymbol{x}}}}^{(l)}=\, {{{\boldsymbol{x}}}}^{(l)}+{\mathtt{FFN}}\,\left({{{\boldsymbol{x}}}}^{(l)}\right).$$
(1)
The SelfAttentionPairBias function is denoted as:
$${{{\boldsymbol{Q}}}}^{(l,h)}=\, {{{\boldsymbol{x}}}}^{(l-1)}{{{\boldsymbol{W}}}}_{Q}^{(l,h)};\,{{{\boldsymbol{K}}}}^{(l,h)}={{{\boldsymbol{x}}}}^{(l-1)}{{{\boldsymbol{W}}}}_{K}^{(l,h)}; \\ {{{\boldsymbol{B}}}}^{(l,h)}=\, {{{\boldsymbol{p}}}}^{(l-1)}{{{\boldsymbol{W}}}}_{B}^{(l,h)};\,{{{\boldsymbol{V}}}}^{(l,h)}={{{\boldsymbol{x}}}}^{(l-1)}{{{\boldsymbol{W}}}}_{V}^{(l,h)};\\ {\mathtt{output}}=\, {\mathtt{softmax}}\left(\frac{{{{\boldsymbol{Q}}}}^{(l,h)}{({{{\boldsymbol{K}}}}^{(l,h)})}^{T}}{\sqrt{{d}_{h}}}+{{{\boldsymbol{B}}}}^{(l,h)}\right){{{\boldsymbol{V}}}}^{(l,h)};\\ $$
(2)
where dh is the head dimension, \({{{\boldsymbol{W}}}}_{Q}^{(l,h)},{{{\boldsymbol{W}}}}_{K}^{(l,h)},{{{\boldsymbol{W}}}}_{V}^{(l,h)}\in {{\mathbb{R}}}^{{d}_{x}\times {d}_{h}}\), \({{{\boldsymbol{W}}}}_{B}^{(l,h)}\in {{\mathbb{R}}}^{{d}_{p}\times 1}\). FFN is a feed-forward network with one hidden layer. For simplicity, layer normalizations are omitted. Compared to the standard Transformer layer, the only difference here is the usage of attention bias term B(l, h) to incorporate p(l−1) from the pair representation track.
Update of pair representation
The pair representation p(0) is initialized by the positional encoding ψ. The update process of pair representation begins with an outer product of x(l), followed by a \({{\mathcal{O}}}({n}^{3})\) triangular multiplication, and is then concluded with an FFN layer. Formally, at l-th block, p(l) is sequentially updated as follow:
$${{{\boldsymbol{p}}}}^{(l)}=\, {{{\boldsymbol{p}}}}^{(l-1)}+{\mathtt{OuterProduct}}\,({{{\boldsymbol{x}}}}^{(l)}); \\ {{{\boldsymbol{p}}}}^{(l)}=\, {{{\boldsymbol{p}}}}^{(l)}+{\mathtt{TriangularUpdate}}\,({{{\boldsymbol{p}}}}^{(l)});\\ {{{\boldsymbol{p}}}}^{(l)}=\, {{{\boldsymbol{p}}}}^{(l)}+{\mathtt{FFN}}\,({{{\boldsymbol{p}}}}^{(l)}).$$
(3)
The OuterProduct is used for atom-to-pair communication, denoted as :
$$ {{\boldsymbol{a}}}={{{\boldsymbol{x}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{O1}}}}^{(l)},{{\boldsymbol{b}}}={{{\boldsymbol{x}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{O2}}}}^{(l)};\\ {{{\boldsymbol{o}}}}_{i,j}={\mathtt{flatten}}({{{\boldsymbol{a}}}}_{i}\otimes {{{\boldsymbol{b}}}}_{j});\\ {\mathtt{output}}={{\boldsymbol{o}}}{{{\boldsymbol{W}}}}_{{{\rm{O3}}}}^{(l)},$$
(4)
where \({{{\boldsymbol{W}}}}_{{{\rm{O1}}}}^{(l)},{{{\boldsymbol{W}}}}_{{{\rm{O2}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{x}\times {d}_{o}}\), do is the hidden dimension of OuterProduct, and \({{{\boldsymbol{W}}}}_{{{\rm{O3}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{o}^{2}\times {d}_{p}}\), o = [oi, j]. Please note that a, b, o are temporary variables in the OuterProduct function. TriangularUpdate is used to enhance pair representation further, denoted as:
$$ {{\boldsymbol{a}}}={\mathtt{sigmoid}}\left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T1}}}}^{(l)}\right)\odot \left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T2}}}}^{(l)}\right); \\ {{\boldsymbol{b}}}={\mathtt{sigmoid}}\left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T3}}}}^{(l)}\right)\odot \left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T4}}}}^{(l)}\right);\\ {{{\boldsymbol{o}}}}_{i,j}=\mathop{\sum}_{k}{{{\boldsymbol{a}}}}_{i,k}\odot {{{\boldsymbol{b}}}}_{j,k}+{\sum}_{k}{{{\boldsymbol{a}}}}_{k,i}\odot {{{\boldsymbol{b}}}}_{k,j};\\ {\mathtt{output}}={\mathtt{sigmoid}}\left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T5}}}}^{(l)}\right)\odot \left({{\boldsymbol{o}}}{{{\boldsymbol{W}}}}_{{{\rm{T6}}}}^{(l)}\right),$$
(5)
where \({{{\boldsymbol{W}}}}_{{{\rm{T1}}}}^{(l)},{{{\boldsymbol{W}}}}_{{{\rm{T2}}}}^{(l)},{{{\boldsymbol{W}}}}_{{{\rm{T3}}}}^{(l)},{{{\boldsymbol{W}}}}_{{{\rm{T4}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{p}\times {d}_{t}}\), \({{{\boldsymbol{W}}}}_{{{\rm{T5}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{p}\times {d}_{p}}\), \({{{\boldsymbol{W}}}}_{{{\rm{T6}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{t}\times {d}_{p}}\), o = [oi, j], and dt is the hidden dimension of TriangularUpdate. a, b, o are temporary variables. The TriangularUpdate is inspired by the Evoformer in AlphaFold218. The difference is that AlphaFold2 uses two modules, “outgoing” (oi, j = ∑k ai,k ⊙ bj,k) and “incoming” (oi,j = ∑kak,i ⊙ bk,j) respectively. In Uni-Mol+, we merge the two modules into one to save the computational cost.
Conformation optimization
The conformation optimization process in many practical applications, such as Molecular Dynamics, is iterative. This approach is also employed in the Uni-Mol+. The number of conformation update iterations denoted as R, is a hyperparameter. We use superscripts on r to distinguish the 3D positions of atoms in different iterations. for example, at the i-th iteration, the update can be denoted as (y, r(i)) = f(X, E, r(i−1); θ). It is noteworthy that parameters θ are shared across all iterations. Moreover, please note that the iterative update in Uni-Mol+ involves only a few rounds, such as 1 or 2, instead of dozens or hundreds of steps in Molecular Dynamics.
3D position prediction head
Regarding the 3D position prediction head within Uni-Mol+, we have adopted the 3D prediction head proposed in Graphormer-3D5, as cited in our manuscript. The architecture takes atom representation xL, pair representation pL, and initial coordinates c as inputs. An attention mechanism is initially employed and then the attention weights is multiplied point-wisely with the pairwise delta coordinates derived from the initial coordinates. Similar to SelfAttentionPairBias, the attention mechanism is denoted as:
$${{{\boldsymbol{Q}}}}^{h}=\, {{{\boldsymbol{x}}}}^{L}{{{\boldsymbol{W}}}}_{Q}^{h};\,{{{\boldsymbol{K}}}}^{h}={{{\boldsymbol{x}}}}^{L}{{{\boldsymbol{W}}}}_{K}^{h}; \\ {{{\boldsymbol{B}}}}^{h}=\, {{{\boldsymbol{p}}}}^{L}{{{\boldsymbol{W}}}}_{B}^{h};\,{{{\boldsymbol{V}}}}^{h}={{{\boldsymbol{x}}}}^{L}{{{\boldsymbol{W}}}}_{V}^{h};\\ {{{\boldsymbol{A}}}}_{i,\, j}^{h}=\, {\mathtt{softmax}}\left(\frac{{{{\boldsymbol{Q}}}}_{i}^{h}{({{{\boldsymbol{K}}}}_{j}^{h})}^{T}}{\sqrt{{d}_{h}}}+{{{\boldsymbol{B}}}}_{i,\, j}^{h}\right);\\ {{\Delta }}{{{\boldsymbol{c}}}}_{i,\, j}=\, {{{\boldsymbol{c}}}}_{i}-{{{\boldsymbol{c}}}}_{j};\,{{{\boldsymbol{A}}}}_{i,\, j}^{(h,0)}={{{\boldsymbol{A}}}}_{i,\, j}^{h}\odot {{\Delta }}{{{\boldsymbol{c}}}}_{i,\, j}^{0};\\ {{{\boldsymbol{A}}}}_{i,\, j}^{(h,\, 1)}=\, {{{\boldsymbol{A}}}}_{i,\, j}^{h}\odot {{\Delta }}{{{\boldsymbol{c}}}}_{i,j}^{1};\,{{{\boldsymbol{A}}}}_{i,\, j}^{(h,2)}={{{\boldsymbol{A}}}}_{i,j}^{h}\odot {{\Delta }}{{{\boldsymbol{c}}}}_{i,\, j}^{2};\\ $$
(6)
where dh is the head dimension, \({{{\boldsymbol{W}}}}_{Q}^{h},{{{\boldsymbol{W}}}}_{K}^{h},{{{\boldsymbol{W}}}}_{V}^{h}\in {{\mathbb{R}}}^{{d}_{x}\times {d}_{h}}\), \({{{\boldsymbol{W}}}}_{B}^{h}\in {{\mathbb{R}}}^{{d}_{p}\times 1}\). Ah is the attention weights, Δcij is the delta coordinate between ci and cj where the superscript 0, 1 and 2 represent the X axis, Y axis and Z axis respectively. Then the position prediction head predicts coordinate updates using three linear projections of the attention head values onto the three axes, which is denoted as:
$${{{\boldsymbol{o}}}}^{0}=\, {{\mathtt{Concat}}}_{h}({{{\boldsymbol{A}}}}^{(h,0)}{{{\boldsymbol{V}}}}^{h});\,{{{\boldsymbol{o}}}}^{0}={{\mathtt{Linear}}}_{1}({{{\boldsymbol{o}}}}^{0}); \\ {{{\boldsymbol{o}}}}^{1}=\, {{\mathtt{Concat}}}_{h}({{{\boldsymbol{A}}}}^{(h,1)}{{{\boldsymbol{V}}}}^{h});\,{{{\boldsymbol{o}}}}^{1}={{\mathtt{Linear}}}_{2}({{{\boldsymbol{o}}}}^{1});\\ {{{\boldsymbol{o}}}}^{2}=\, {{\mathtt{Concat}}}_{h}({{{\boldsymbol{A}}}}^{(h,2)}{{{\boldsymbol{V}}}}^{h});\,{{{\boldsymbol{o}}}}^{2}={{\mathtt{Linear}}}_{3}({{{\boldsymbol{o}}}}^{2});\\ {{\Delta }}{{\boldsymbol{c}}}^{\prime}=\, {\mathtt{Concat}}([{{{\boldsymbol{o}}}}^{0},{{{\boldsymbol{o}}}}^{1},{{{\boldsymbol{o}}}}^{2}]);\,{{\boldsymbol{c}}}^{\prime}={{\boldsymbol{c}}}+{{\Delta }}{{\boldsymbol{c}}}^{\prime} ;$$
(7)
where \({{\Delta }}{{\boldsymbol{c}}}^{\prime}\) is the predicted coordinate updates and \({{\boldsymbol{c}}}^{\prime}\) is the predicted coordinates.
As described in the above formula, the coordinate prediction head used in our study does not inherently enforce strict equivariance. This challenge can be addressed through one of two strategies: (1) Strict equivariance of the model can be achieved by sharing the parameters across the three linear layers in Eq. (7)-denoted as linear1, linear2, and linear3-and concurrently eliminating the bias terms within these layers; (2) the model’s robustness to spatial transformations can be enhanced by incorporating random rotations into the input coordinates as a form of data augmentation. During our experimental phase, both techniques were rigorously tested. The latter approach-data augmentation via random rotations yielded better accuracy in quantum chemistry property predictions and was thus selected for our model architecture. In this case, empirical evidence suggests that with a sufficiently large training dataset, such as the PCQM4MV2 dataset, the model naturally tends towards an equivariant state. Specifically, our observations indicate that the parameters of the three linear layers tend to converge to the same, and the bias terms asymptotically approach zero, with the discrepancies being marginal (on the order of 1e − 4).
Training strategy
In DFT conformation optimization or Molecular Dynamics simulations, a conformation is optimized step-by-step, resulting in a trajectory from a raw conformation to the equilibrium conformation in Euclidean space. However, saving such a trajectory can be expensive, and publicly available datasets usually provide the equilibrium conformations only. Providing a trajectory would be beneficial as intermediate states can be used as data augmentation to guide the model’s training. Inspired by this, we propose a novel training approach, which generates a pseudo trajectory first, samples a conformation from it, and uses the sampled conformation as input to predict the equilibrium conformation. This approach allows us to better exploit the information in the molecular data, which we found can greatly improve the model’s performance. Specifically, we assume that the trajectory from a raw conformation r init to a target equilibrium conformation r tgt is a linear process. We generate an intermediate conformation along this trajectory via noisy interpolation, i.e.,
$${{{\boldsymbol{r}}}}^{(0)}=q{{{\boldsymbol{r}}}}^{{{\rm{init}}}}+(1-q)\left({{{\boldsymbol{r}}}}^{{{\rm{tgt}}}}+{{\boldsymbol{c}}}\right),$$
(8)
where scalar q ranges from 0 to 1, the Gaussian noise \({{\boldsymbol{c}}}\in {{\mathbb{R}}}^{n\times 3}\) has a mean of 0 and standard deviation υ (a hyper-parameter). Taking r(0) as input, Uni-Mol+ learns to update towards the target equilibrium conformation r tgt. During inference, q is set to 1.0 by default. However, during training, simply sampling q from a uniform distribution ([0.0, 1.0]) may cause (1) a distributional shift between training and inference, due to the infrequent sampling of q = 1.0 (RDKit-generated conformation), and (2) an inability to learn an accurate mapping from the equilibrium conformation to the QC properties, as q = 0.0 (target conformation) is also not sampled often. Therefore, we employ a mixture of Bernoulli and Uniform distributions to flexibly assign higher sample probabilities to q = 1.0 and q = 0.0, while also sampling from interpolations. The above process is illustrated in Fig. 1c in Supplementary.
The model takes r(0) as input and generates r(R) after R iterations. Then, the model uses r(R) as input and predicts the QC properties. L1 loss is applied to the QC property regression and the 3D coordinate prediction. All loss calculations are performed solely on the final conformer at the last iteration.