Philipp Renz a, Dries Van Rompaey b, Jörg Kurt Wegner b, Sepp Hochreiter a, Günter Klambauer a,
a LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz,Altenberger Strasse 69, A-4040 Linz, Austria b High Dimensional Biology and Discovery Data Sciences, Janssen Research & Development, Janssen Pharmaceutica N.V., Turnhoutseweg 30, Beerse B-2340, Belgium
The paper can be found here: https://www.sciencedirect.com/science/article/pii/S1740674920300159
Feel free to send questions to [email protected].
The main points of the paper are:
- By just making tiny changes to training set molecules we get high scores on the Guacamol benchmark and beat all models except an LSTM. This relies on the fact that the novelty metric is rather permissive.
- Molecular optimizers are often used to find samples that scored highly by machine learning models. We show that these high scores can overfit to the scoring model.
Steps to reproduce the paper:
pip install -r requirements.txt
conda install rdkit -c rdkit
Install cddd by following the instructions on https://github.com/jrwnter/cddd and download pretrained model
wget https://raw.githubusercontent.com/jrwnter/cddd/master/download_default_model.sh -O- -q | bash
The compounds are used for distribution learning and for starting populations for the graph-based genetic algorithm.
mkdir data
wget -O data/guacamol_v1_all.smiles https://ndownloader.figshare.com/files/13612745
wget -O data/guacamol_v1_test.smiles https://ndownloader.figshare.com/files/13612757
wget -O data/guacamol_v1_valid.smiles https://ndownloader.figshare.com/files/13612766
wget -O data/guacamol_v1_train.smiles https://ndownloader.figshare.com/files/13612760
The csv-files downloaded from ChEMBL are located in assays/raw
.
Running the preprocess.py
script will transform the data into binary classification tasks and store them in assays/processed
.
For the distribution-learning experiment (AddCarbon model) is suffices to run addcarbon.py
For the goal-directed generation benchmarks more steps have to be taken.
preprocess.py
: Preprocess the data to obtain binary classification tasks.run_goal_directed.py
: This runs all the molecular optimization experiments.predictions.py
: This fits a classifier multiple times with different random seeds, mainly to estimate the optimization/control score combinations of split 1 actives. The results are used to get the contours in the scatter plots (Fig. 2, S1)plots.ipynb
: Notebook to create most of the plots in the papernearest_neighbours.ipynb
: Notebook to calculate nearest neighbour distances and to create Fig. S4 (histograms over Tanimoto similarities)
Special thanks goes out to the authors of Guacamol (Paper / Github). Their code was very helpful in implementing our experiments.