In Support of Data-Centric AI for Drug Discovery

Earlier this year Andrew Ng made a call for a shift to data-centric AI. Quoted in an article from Forbes, Ng said “Now that the models have advanced to a certain point, we got to make the data work as well.” Since reading this article, I have cited Ng and his sentiment frequently. My conversations reminded me of a story my high school biology teacher told. In my recollection, Carl Sagan argued to add a camera to the Viking I probe in case a Martian elephant walked up to it while it was too busy collecting soil to notice. While a search on the web indicates that the actual anecdotes are in fact about a hedgehog, a rabbit, and a tortoise, the sentiment still rings true: if we are too focused on algorithms and neglect to develop technology to answer other important data science questions, do we risk missing something game changing?

In the last several years, for good reason, AI research has been heavily focused on algorithms and moving the needle of metrics like ROC-AUC often on curated and well-studied data sets. However, as Ng indicates, there are other equally important questions such as whether we have the data or the means to collect it and whether we are developing the cross-disciplinary expertise to ask the right questions in each target application. Without this rebalancing, it is legitimate to wonder whether we are set up to even know whether a grad-student somewhere already has an algorithm that would revolutionize a key question in drug discovery.

In my recent internet research for one of my clients on AI-for-drug-discovery, it is striking that in four years the list of AI-biotech companies has gone from a single slide to a full-on industry report. In my area of early small molecule discovery, a stream of increasingly powerful open-source architectures and code (e.g. DeepChem, ChemProp) is enabling discovery teams across the industry. It is becoming more difficult for the average company to keep up with what may be freely available for download now not to mention what may be published in a month or a year. Focusing on building the best algorithm may in fact be distracting from important questions of what ground-breaking science can be done by AI-enabled domain experts applying the technology to real-world problems.

I am in no way arguing that we drop algorithm development, it remains of central importance. I too want there to be models so powerful that a future Dr. McCoy can analyze patient zero and develop a cure before the end of an episode. My point is to echo Ng’s call for re-balancing. The Forbes article goes on to quote Google Researchers saying “Paradoxically, data is the most under-valued and de-glamorised aspect of AI”. That is a problem that needs to be solved. The drug discovery community needs to broaden its focus in data science to emphasize the importance of data generation, insight development, and increased AI literacy across disciplines. ‘Devaluing’ domain expertise, data collection and logistics will only prolong our time in the wilderness looking for the Martian elephant that may well be sitting in front of us.