On failure modes of molecule generators and optimizers

Philipp Renz ^a, Dries Van Rompaey ^b, Jörg Kurt Wegner ^b, Sepp Hochreiter ^a, Günter Klambauer ^a,

^a LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz,Altenberger Strasse 69, A-4040 Linz, Austria ^b High Dimensional Biology and Discovery Data Sciences, Janssen Research & Development, Janssen Pharmaceutica N.V., Turnhoutseweg 30, Beerse B-2340, Belgium

The paper can be found here: https://www.sciencedirect.com/science/article/pii/S1740674920300159

Feel free to send questions to [email protected].

TLDR:

The main points of the paper are:

By just making tiny changes to training set molecules we get high scores on the Guacamol benchmark and beat all models except an LSTM. This relies on the fact that the novelty metric is rather permissive.
Molecular optimizers are often used to find samples that scored highly by machine learning models. We show that these high scores can overfit to the scoring model.

Code

Steps to reproduce the paper:

Install dependencies

pip install -r requirements.txt
conda install rdkit -c rdkit

Install cddd by following the instructions on https://github.com/jrwnter/cddd and download pretrained model

wget https://raw.githubusercontent.com/jrwnter/cddd/master/download_default_model.sh -O- -q | bash

Download Guacamol data splits

The compounds are used for distribution learning and for starting populations for the graph-based genetic algorithm.

mkdir data
wget -O data/guacamol_v1_all.smiles https://ndownloader.figshare.com/files/13612745
wget -O data/guacamol_v1_test.smiles https://ndownloader.figshare.com/files/13612757
wget -O data/guacamol_v1_valid.smiles https://ndownloader.figshare.com/files/13612766
wget -O data/guacamol_v1_train.smiles https://ndownloader.figshare.com/files/13612760

Bioactivity data

The csv-files downloaded from ChEMBL are located in assays/raw. Running the preprocess.py script will transform the data into binary classification tasks and store them in assays/processed.

Experiments

For the distribution-learning experiment (AddCarbon model) is suffices to run addcarbon.py

For the goal-directed generation benchmarks more steps have to be taken.

preprocess.py: Preprocess the data to obtain binary classification tasks.
run_goal_directed.py: This runs all the molecular optimization experiments.
predictions.py: This fits a classifier multiple times with different random seeds, mainly to estimate the optimization/control score combinations of split 1 actives. The results are used to get the contours in the scatter plots (Fig. 2, S1)
plots.ipynb: Notebook to create most of the plots in the paper
nearest_neighbours.ipynb: Notebook to calculate nearest neighbour distances and to create Fig. S4 (histograms over Tanimoto similarities)

Special thanks

Special thanks goes out to the authors of Guacamol (Paper / Github). Their code was very helpful in implementing our experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
assays		assays
guacamol_baselines		guacamol_baselines
mso @ 82d9eec		mso @ 82d9eec
.gitattributes		.gitattributes
.gitconfig		.gitconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
addcarbon.py		addcarbon.py
metrics.ipynb		metrics.ipynb
nearest_neighbours.ipynb		nearest_neighbours.ipynb
optimize.py		optimize.py
parametersearch.py		parametersearch.py
plot_utils.py		plot_utils.py
plots.ipynb		plots.ipynb
predictions.py		predictions.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
results.zip		results.zip
run_goal_directed.py		run_goal_directed.py
substructure_search.ipynb		substructure_search.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On failure modes of molecule generators and optimizers

TLDR:

Code

Install dependencies

Download Guacamol data splits

Bioactivity data

Experiments

Special thanks

About

Releases

Packages

Contributors 2

Languages

ml-jku/mgenerators-failure-modes

Folders and files

Latest commit

History

Repository files navigation

On failure modes of molecule generators and optimizers

TLDR:

Code

Install dependencies

Download Guacamol data splits

Bioactivity data

Experiments

Special thanks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages