The time is fast approaching when we must embed technological advances into clinical practices. The Academy of Royal Medical Colleges (ARMC) in the United Kingdom notes in their recent report: “For AI to truly flourish, not only must IT be overhauled and made inter-operable, but the quality and extent of health data must be radically improved too”. We can help ensure data quality by considering data provenance as a core element of our analysis.
Beware of garbage in – garbage out
AI models depend on data, with more data often leading to better performance – up to a point. It’s a case of quality over quantity, with high-quality data contributing to much better models. For example, referencing US Department of Transportation studies, “Google collected far more data per car [than other companies] to feed a more advanced machine learning system, and its cars improved by 400% – an amazing jump in innovation, and more than ten times as much as cars utilizing fewer data.”
However, we have become very familiar with the possible effects of ‘garbage in – garbage out’ on the Internet. A 2018 study by MIT leaves us in no doubt that the impact and scale of misinformation are significant: “It seems to be pretty clear [from our study] that false information outperforms true information”. Unsupervised AI training processes do not have a ‘human in the loop’ to fend off dubious training data. Moreover, many AI systems are being shown to be very ‘brittle’ in the face of adversaries who supply intentionally ‘bad’ training data.
In the context of healthcare, ARMC recommends: “For those who meet information handling and governance standards, data should be made more easily available across the private and public sectors. It should be certified for accuracy and quality […] External critical appraisal and transparency of tech companies is necessary for clinicians to be confident that the tools they are providing are safe to use. In many respects, AI developers in healthcare are no different from pharmaceutical companies who have a similar arms-length relationship with care providers. This is a useful parallel and could serve as a template. As with the pharmaceutical industry, licensing and post-market surveillance are critical and methods should be developed to remove unsafe systems.”
As a concrete example, AI model performance is strongly dependent on label quality in the work of Hannun et al. on arrhythmia detection, yet the input labels that characterise training data are only approximately 75% accurate. The safety ramifications of such poor metrics are a sobering thought.
Where an AI system does not have a human in the loop, who mediates training data? When machines talk directly to machines, there is no-one between the input data, the model, and the answers.
How do we ensure the quality of AI capabilities?
How do we ensure that our data sets come from reliable sources, and are of credible authenticity and quality, especially when we may rely on totally automated – and autonomous – processes to obtain and assimilate that data? And how do we ensure that AI models resulting from training on such data are safe? How do we know that these models have been tried and tested? Should we seek independent validation to help ensure quality and safety? Similar questions have been asked of pharmaceuticals for half a century: are AI capabilities for use in healthcare so very different?
Provenance information is critical
Provenance information seeks to increase confidence in digital information and processes, by providing contextual and circumstantial information concerning the production or discovery of that object, and its history. “As the quantity of data that contributes to a particular result increases, keeping track of how different sources and transformations are related to each other becomes more difficult. This [chal- lenge] constrains the ability to answer questions regarding a history of a result, such as: What were the underlying assumptions on which the result is based? Under what conditions does it remain valid? What other results were derived from the same data sources?”.
Provenance information needs to provide answers to these (and other similar) questions, and this information needs to have demonstrable independence from the supplier to provide enough assurance of its validity.
In the art market, provenance translates into value: “Nothing in the known universe, no item, object or quantity of material, has ever appreciated in value as fast as the Salvator Mundi”, whose value rose from USD 1,175 in 2005 to USD 450 million in 2018, following its attribution to Leonard da Vinci. In the healthcare sector, provenance translates into safety as well as financial benefit.
Regulation of health AI provenance?
Pharmaceutical products have been regulated in the EU for half a century, accelerated in large part following the Thalidomide disaster, which exemplified the need for evidence-based authorisation. For example, the European Commission summarises the widespread situation: “When applying for marketing authorisation, companies must provide documentation showing that the product is of suitable quality […] The manufacture or import of medicinal products itself […] is subject to manufacturing or import authorisation.”
And the call for good provenance is extending to the field of genetics. The World Health Organisation (WHO) guidelines say: “The cloning procedure should be carefully documented, including the provenance of the original culture, the cloning protocol, and the reagents used.”
We urgently need to take key policy measures for the provenance of AI capabilities used in healthcare. For example, as a 2018 report from the UK Government Office for Science said concerning computer modelling: “Decision-makers need to be intelligent customers for models, and those that supply models should provide appropriate guidance to model users to support proper use and interpretation. This includes providing suitable model documentation detailing model purpose, assumptions, sensitivities, and limitations, and evidence of appropriate quality assurance … Government and the corporate sector need to consider how to govern and where necessary regulate the use of advanced models of complex systems”.
In the digital domain, patient safety depends not only on the provenance of its AI models and processes but also on the validity, security and privacy of the associated provenance data itself. We must take a view on the balance in regulations between the opposing forces of increased availability from open sharing, with the safety, privacy and security ramifications, in a digital world where cyber-attacks are increasingly impactful, and commercial motivations seek to aggregate and control critical information.
Daryl Arnold – OCEAN PROTOCOL