AI-driven integrative biology for accelerating therapeutic drug discoveries to treat SARS-CoV-2

CompBioMed partners and academics from other research facilities are carrying out research to address the fundamental biological mechanisms of both the SARS-CoV-2 novel corona virus and the COVID-19 disease it causes. The ongoing research also includes an attempt to target the entire viral proteome (entire complement of proteins that is or can be expressed by a cell, tissue or organism) in order to identify potential therapies.

The work described in today’s blogpost is being carried out by Argonne National Laboratory (ANL) – international core partner in the CompBioMed project – and is part of wider, international effort including experimentalists in other facilities (who are contributing extensively to knowledge of the 3D structure of the virus using a wide range of characterisation techniques), and other computational and theoretical scientists. It involves, on ANL’s part, the development of machine learning (ML), deep learning (DI), and artificial intelligence (AI) techniques to: build accurate 3D structural models of the entire complement of structural forms the SARS-CoV-2 virus can take; accelerate the identification of novel binding sites on the viral protein, which could potentially be targeted by small molecules in therapies; and rapidly filter and rank molecules by their likely effectiveness for binding to the identified sites (and hence their likely effectiveness in the therapies).

It is important for us to understand the full complement of structural forms that the virus can take at different times and in different conditions to fully exploit available preventatives and therapies. Likewise, the identification of novel binding sites and the ability to rapidly filter and rank molecules is important to allow us to fast-track the billion or so potential compounds in available libraries to identify the most promising ones, a task which would be impossible without the techniques being developed by ANL.

Fast tracking of the huge libraries of compounds is being achieved through the use of an active learning strategy to “seed” the molecules to the identified active sites on the protein targets, followed by refinement using AI-driven adaptive sampling strategies. On a more microscopic level this has required them to make developments in reinforced learning (RL) in order to allow them to drive small molecules onto the virus proteins whilst the virus is moving in computational space, and develop interfaces to new and existing AI tools and physics models to produce a complete picture. The generated results are being verified using uncertainty scoring criteria which they developed originally for research to predict the response of cancer cell to drug treatments. Initial tests of this technique on a random sub-set of molecules has reported “scores” for the top-10% of molecules most likely to be effective for treating COVID-19. These scores are 10 times greater than the bottom-10%, indicating that we can significantly increase the scope and speed of the search for effective drugs, compared to traditional manual methods. The results of the candidate filtering and uncertainty scoring on the full molecular database will not only provide a priority list for experimental investigations, significantly reducing the time taken to find a therapies, but they will also form part of an active learning effort to determine the chemical space which efforts should be focused upon.

Along with the development of AI techniques to quickly filter out unsuitable molecules, the group at ANL are also advancing the way in which the more robust calculations required to calculate the free energies of binding of the molecules to the protein active sites are carried out. It is only in recent years that binding free energy predictions from both molecular simulations and ML have become mature enough to be useful in industrial scale drug discovery. Even now, most studies using the techniques involve tens, or at most hundreds of systems. The ongoing studies to find therapies for COVID-19 however involve molecular databases several times bigger than this, so are much more ambitious. The studies also require analysis of hundreds of individual mutations of single molecules and understanding of how thousands of combinations of changes impact the full range of molecules.

One main barrier to running molecular simulations at large-scales currently is the vast amount of computational resources required - which is resolved through CompBioMed partners’ access to some of the world’s largest high performance computing facilities. Another barrier relates to the way that molecular simulation works; they are prone to becoming trapped within local energetic minima on the potential energy surfaces they use, which makes them much less productive for sampling new molecular states. Typically, researchers manually track the progress of their calculations and through experience are able to identify and intervene if a simulation is not progressing correctly. Clearly with tens or hundreds of thousands of calculations running simultaneously, this is not practical.

To overcome this limitation, the researchers are utilising codes which they have previously developed – Enhanced Sampling of Molecular dynamics with Approximation of Continuum Solvent (“ESMACS”) and Thermodynamic Integrations with Enhanced Sampling (“TIES”) – along with statistical methods to automate the process of checking. The basic approach involves performing an ensemble – with ESMACS – of parallel simulations in different starting configurations, which are then managed iteratively based on statistical criteria. The computer is able to make decisions about which simulations should be continued and which should be terminated, and identify new configurations to begin sampling without human intervention. The TIES code is then used to determine the change in binding affinity (how well the sample molecules interact with the virus) when the small functional groups on the molecules are mutated into other. The data obtained by combining the ESMACS and TIES codes will be used both to inform experimental studies and train new ML algorithms.