A CHEMBL-OG Position, Multitasking Neural Networks on ChEMBL Using PyTorch 1.0 and RDKit by Eloy, in 2019, showed how to train a multitasking neural network for bioactivity prediction using data from ChEMBL. rice field. Specifically, it predicts targets for which a particular molecule may be bioactive. Eloy’s blog post has links to more information, but multitasking neural networks are very interesting. This is because the way information is transferred between various tasks during training can result in predictions for individual tasks that are more accurate than those obtained during training. I just built a model just for that task.
This is very different from most humans. Performance tends to decrease when you start multitasking. In any case, this is an interesting problem, and Eloy provided all the code needed to get the data from his ChEMBL and reproduce his work, so I’m picking this up and looking at the multitasking model Decided to build his KNIME workflow that uses . Since I didn’t have to spend a lot of time preparing the data (thanks Eloy!), I was able to use Eloy’s Jupyter notebook directly to train and validate the model. After letting the workstation sit for a while, I prepared a trained model. All that was left was to build the predictive workflow.
are you listening? We have a new podcast!
We tune in each week to discuss how different data professionals have built their careers and share tips and tricks for those looking to follow in their footsteps.
Load the network and generate predictions
Eloy’s notebook builds multitasking neural networks using PyTorch, which my company’s platform doesn’t support directly, but fortunately both platforms have ONNX (Open Neural Network Exchange) format for exchanging trained networks between neural network toolkits. Therefore, he was able to export his trained PyTorch model for bioactivity prediction to ONNX and load it into the company’s platform. ONNX Network Reader In a node, convert it to a TensorFlow network. ONNX to TensorFlow network converter Generate predictions using nodes TensorFlow Network Executor node.
Now that we have loaded the trained network into the platform, we need to create the correct inputs.model is RD kitwhich is very simple RDKit KNIME integration.
We know that the model was trained using the RDKit’s Morgan fingerprint of radius 2 and length 1024 bits, and we use the same fingerprint as RDKit fingerprint node. Since the fingerprint cannot be passed directly to the neural network, Extend bit vector Use a node to transform individual bits of the fingerprint into columns of the input table. Compounds to generate fingerprints are read from a text file containing SMILES and a column of compound IDs used as names. The sample dataset used in this blog post (and sample workflow) consists of a set of molecules exported from ChEMBL and some invented compounds created by manually editing the ChEMBL molecules.
the output of TensorFlow Network Executor node is a table containing one row for each molecule that generated predictions and one column for each of the 560 targets the model was trained on. Cells contain the compound’s score against its corresponding target (Figure 2).
At this point, we have a very minimal forecasting workflow. A multitasking neural network can be used to generate scores for new compounds. The rest of this post will show you several ways to view the results. This makes it easier to interact with the results.
View forecasts in an interactive heatmap
The first interactive view used to display predictions from the multitasking neural network includes a heatmap containing the predictions themselves and a tiled view showing the molecules from which the predictions were generated. The heatmap has compounds in rows, targets in columns, and the calculated score determines the color of the cell. The tile view is configured to show only selected rows.
The Show Predictions as Heatmap component that exposes this interactive view is configured to pass only selected rows to its output port. Therefore, in the example shown in Figure 3, there are only two rows in the output of the Display Predictions as Heatmap component.
The workflow does a lot of data processing to build the heatmap. We won’t go into details here, but the main work is done in the “reformat by bisort” metanode, which sorts compounds and targets based on their median scores. This moves targets with higher scoring compounds to the left side of the heatmap and moves high scoring compounds to the top of the heatmap for more targets. Qualitatively, the heatmap turns red when panning up and left, and blue when panning down and right. There is no best answer regarding the best sort criteria for this purpose. If you want to try something other than the median, feel free to play around with the settings for the Sort node in the “Reformat with Bi-Sort” Metanode.
Comparison of predicted and measured values
A good way to trust your model’s predictions is to compare them to measured data. Normally this is not possible, but sometimes the measured data associated with the compound generating the prediction is available. In such cases, it would be nice to display that measured data alongside the forecast. The rest of the workflow does just that (Figure 5).
It starts by generating InChI keys for the molecules in the prediction set, searches them using the ChEMBL REST API, and uses the API again to find relevant activity data measured for those compounds. increase. Daria Goldman wrote a blog post titled her. A RESTful way to find and retrieve data A few years ago I showed you how to do this. I tweaked the components she introduced in this blog post for this use her case and combined everything in the “get her ChEMBL data if it exists” metanode.
The metanode output table has one row per compound, the ChEMBL ID for each compound found in ChEMBL, and one column for each target that had an experimental value in ChEMBL for one of the compounds in the prediction set. This data can be visualized along with the predictions using the Show predictions and measurements component (Figure 6).
This interactive view is primarily based on the scatterplot at the top. Each point in the plot corresponds to one compound with data measured against one target. Target CHEMBLIDs are on the X-axis and measured pCHEMBL values (provided by the ChEMBL web service) are on the Y-axis. The size of the points in the plot is determined by the compound’s calculated score for that target. Scatter plots are interactive. Selecting a point displays the associated compound in the lower left table of the view and the corresponding score and measurement data in the lower right table.
If the model is performing very well, the scatterplot in Figure 6 would be expected to show large scores (large points) for compounds with high activity (high pCHEMBL values). pointing down. That’s more or less what we observe. There are clearly some outliers, but it’s probably fine to pay at least some attention to the model’s predictions for other compounds/targets. (Note: Most of the data points used in this example are actually in the model’s training set, so this is not a perfectly valid assessment. To demonstrate the view and its interactivity, Here is an example.)
summary
In this blog post, we showed how to import a multitasking neural network for bioactivity prediction built in PyTorch into a workflow and use it to generate predictions for new compounds. We also showed some interactive views for manipulating the model’s predictions and gaining confidence. Workflows, pre-trained models, and sample data are available in the hub to download, study, and use in your own work.
Download the ‘Generate Predictions with ONNX’ workflow from the hub here.