None of this is to say that machine learning is a silver bullet. Even with self-educating computers, the GIGO principle — garbage in, garbage out — still applies. Without the input of accurate scientific data the results of machine learning algorithms will be at best mixed and at worst disastrous. It’s no use trying to get a computer to recognize a puppy if the data it is being fed is pictures of kittens.
Successfully using machine learning in material innovation requires clearly classified, comprehensive data from varied sources. Without this input, a machine may draw data from a single source and apply it in all calculations (‘garbage’), and fail to factor in limitations and conditions. Yet in reality, mixing two chemicals, for example, can produce radically different reactions if the ambient temperature is altered by 200 degrees — something a computer working with insufficient data won’t recognize. Further, for the scientists carrying out these experiments in the lab, it is essential the data are comprehensive and accurate, to ensure safety while pursuing innovation.
Multiple factors can affect chemical reactions, so it is crucial that machines are fed comprehensive data with context (i.e., when is information reusable? When is it not applicable?). In the same way that doctors must understand how prospective reactions between medicines could affect patients differently based on medical condition and dosage, machine learning tools need contextual data in order to evaluate the likelihood of calculations being correct. A repository of verified, cross-domain, trusted data must be the foundation of any machine learning system; so that insights are accurate, safe, based on proven data and likely to be carried out successfully in the real world.