How well can molecules be generated by AI?

Deep learning and artificial intelligence facilitates advances in all areas of science, including cheminformatics. One of the most impressive advances in this area to date has been deep molecular generative models, which Josep Arús-Pous and Ola Engkvist discuss.

深层分子生成模型…使大量分子能够从一小部分已知分子中产生

One of cheminformatics’ main goals is to design novel molecules by exploring the whole chemical space. This virtual chemical space is comprised of all possible molecules, and by some estimated to be around 10^60 molecules.

直到最近,通过使用遗传算法产生分子或将较小的分子结合在一起以形成较大的分子来完成化学空间探索。

生成的数据库(GDB)项目是一种不同的,仍在进行的方法,该项目旨在枚举所有分子,直到一定数量的非氢原子。迄今为止,已经列举了高达11、13和17个原子的类似药物的化学空间,并分别以2500万,10亿和1650亿个分子产生数据库。

然后可以搜索这些巨大的数据库,以找到有趣的分子。随着化学空间随着原子的数量呈指数增长,探索它的较大区域变得越来越困难。

就像我n many disciplines of science, deep learning and artificial intelligence (AI) have opened new perspectives in cheminformatics. For example, we can train neural networks with data from heterogeneous sources to predict the toxicity of molecules, or we can design algorithms that obtain all the steps required to synthesize any molecule.

But one of the most impressive advances to date has been deep molecular generative models, which enable large amounts of molecules to be generated from a small set of known ones.

特别是,经过培训的经常性神经网络(RNN)molecules represented in a text format called SMILES事实证明,在探索化学空间方面非常成功。

它们的工作方式与手机键盘的自动完整功能类似,并考虑到已经产生的部分分子,逐个字符构建分子角色(图1)。随着模型学习如何仅从训练数据创建分子时,生成的分子倾向于显示与训练集中的分子相同的特性。

Figure 1: Process of generating aspirin using a recurrent neural network (RNN) and the SMILES syntax. The RNN creates the molecule character by character and samples the next character from a probability distribution of all possible characters. Depending on what has been generated before, the model changes the probability of different characters being sampled.
© The Authors

Therefore, the chemical space around the training set can be explored by repeatedly generating (also called sampling) molecules from an RNN.

分子生成模型的缺点之一是,它们被替换采样:当模型被多次采样时,可能会出现重复的分子。这可能会产生模型似乎会产生各种分子的情况,但是它正在反复生成相同的集合。

另外,模型可以生成大量与训练集完全无关的分子,这意味着该模型不会从训练集分子中学习。在我们的研究中, we developed a benchmarking method that is able to detect both situations.

为此,我们在前面提到的GDB-13分子数据库的子集上进行训练模型,我们采样了20亿次的型号,并计算出GDB-13的一部分是多少唯一的产生分子,以及它的外部有多少个。。然后,通过从少量样本中学习时可以生成多少整个数据库来对模型进行排名。

我们使用此方法来基准数个分子生成模型用两种表示分子的方式:规范和随机微笑。这种基于文本的表示是通过编号分子中的原子,然后将其穿越,在碳中添加“ c”,氧气等“ O”等(图2)。

图2:阿司匹林微笑字符串的生成过程。请注意,通过更改原子顺序,可以获得不同的微笑。
© The Authors

By default, most chemistry software calculates a unique atom numbering, called canonical ordering, and the canonical SMILES is generated from it. When this atom numbering is randomized, different SMILES strings that represent the same molecule can be obtained, thus obtaining randomized SMILES.

During training, models using canonical SMILES always use the same representation for each molecule, whereas those using randomized SMILES keep changing it. This allows the models to see the same molecule in different angles, learning different information every time.

当比较一百万个GDB-13分子(数据库的0.1%)的模型与规范和随机微笑的模型时,结果表明,规范的微笑模型能够生成高达GDB-13的70%,而随机微笑模型则可以生成工作得更好,最多可获得整个数据库的83%。

Smaller training sets were also used, and when a randomized SMILES model was trained with 1000 molecules (0.0001% of GDB-13), 34% of the whole database was obtained compared to only 14% generated with a canonical SMILES model.

鉴于用随机微笑获得的改进,对使用Chembl数据库训练的模型进行了最后一个实验。该数据库拥有从文献中获得的150万个分子选择,这些分子代表了样本,该样品是整个已知的类似药物样的化学空间。

Results showed that models trained with randomized SMILES generated circa 1.3 billion unique molecules from a 2 billion sample, which was nearly double the amount of molecules than with canonical SMILES.

此外,我们还表明,典型的微笑模型产生的任何分子都可以由随机微笑模型产生,但反之亦然。

总之,经过小分子样品训练的生成模型能够概率地持有大量的化学空间。然后,可以通过广泛的采样和过滤或使用诸如诸如reinforcement learning

Having architectures that can convey as much information as possible from small training sets to the generated molecules is extremely important in drug discovery and randomized SMILES are an improvement that helps on that direction.

These developed methods are currently used in many drug discovery projects in AstraZeneca, and we hope to report back to the scientific community the results applying them to in-house projects in due course.

View the latest posts on the On Physical Sciences homepage

注释