Skip to content

Shows some of the ways molecule generation and optimization can go wrong

Notifications You must be signed in to change notification settings

ml-jku/mgenerators-failure-modes

Repository files navigation

On failure modes of molecule generators and optimizers

Philipp Renz a, Dries Van Rompaey b, Jörg Kurt Wegner b, Sepp Hochreiter a, Günter Klambauer a,

a LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz,Altenberger Strasse 69, A-4040 Linz, Austria b High Dimensional Biology and Discovery Data Sciences, Janssen Research & Development, Janssen Pharmaceutica N.V., Turnhoutseweg 30, Beerse B-2340, Belgium

The paper can be found here: https://www.sciencedirect.com/science/article/pii/S1740674920300159

Feel free to send questions to [email protected].

TLDR:

The main points of the paper are:

  • By just making tiny changes to training set molecules we get high scores on the Guacamol benchmark and beat all models except an LSTM. This relies on the fact that the novelty metric is rather permissive.
  • Molecular optimizers are often used to find samples that scored highly by machine learning models. We show that these high scores can overfit to the scoring model.

Code

Steps to reproduce the paper:

Install dependencies

pip install -r requirements.txt
conda install rdkit -c rdkit

Install cddd by following the instructions on https://github.com/jrwnter/cddd and download pretrained model

wget https://raw.githubusercontent.com/jrwnter/cddd/master/download_default_model.sh -O- -q | bash

Download Guacamol data splits

The compounds are used for distribution learning and for starting populations for the graph-based genetic algorithm.

mkdir data
wget -O data/guacamol_v1_all.smiles https://ndownloader.figshare.com/files/13612745
wget -O data/guacamol_v1_test.smiles https://ndownloader.figshare.com/files/13612757
wget -O data/guacamol_v1_valid.smiles https://ndownloader.figshare.com/files/13612766
wget -O data/guacamol_v1_train.smiles https://ndownloader.figshare.com/files/13612760

Bioactivity data

The csv-files downloaded from ChEMBL are located in assays/raw. Running the preprocess.py script will transform the data into binary classification tasks and store them in assays/processed.

Experiments

For the distribution-learning experiment (AddCarbon model) is suffices to run addcarbon.py

For the goal-directed generation benchmarks more steps have to be taken.

  1. preprocess.py: Preprocess the data to obtain binary classification tasks.
  2. run_goal_directed.py: This runs all the molecular optimization experiments.
  3. predictions.py: This fits a classifier multiple times with different random seeds, mainly to estimate the optimization/control score combinations of split 1 actives. The results are used to get the contours in the scatter plots (Fig. 2, S1)
  4. plots.ipynb: Notebook to create most of the plots in the paper
  5. nearest_neighbours.ipynb: Notebook to calculate nearest neighbour distances and to create Fig. S4 (histograms over Tanimoto similarities)

Special thanks

Special thanks goes out to the authors of Guacamol (Paper / Github). Their code was very helpful in implementing our experiments.

About

Shows some of the ways molecule generation and optimization can go wrong

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages