Machine learning at the CSIRO has automated a faster way to spot emerging covid variants.
The powerful tool, however, is stymied by a lack of interoperability that would link genome data back to patient outcomes.
The new tool, VariantSpark, analyses the RNA of the whole variant, rather than the current method of monitoring changes to just the spike protein. CSIRO scientist Associate Professor Denis Bauer said this means it can account for small changes.
“On their own these changes may not seem significant but when combined with other small changes can influence the way the virus behaves. Our approach was able to identify variants that could be monitored a week before they were flagged by health organisations – and a week is a long time when you’re trying to outsmart a pandemic,” Professor Bauer said.
The tool was programmed to provide hourly updates which could enable rapid information sharing with public health decision makers to prepare hospitals for increases in demand. However, a lack of interoperability is a major barrier.
“In order to be able to anticipate, say the hospital load, or whether vaccine is failing, you need to know immediately when it’s actually happening. We could do that if we had that interoperable data,” Professor Bauer said.
VariantSpark’s machine learning algorithm analysed the genomes of 10,000 covid samples; the largest number of samples ever analysed in this way. Covid patients with mild or no symptoms were compared with patients that had severe outcomes including death. The covid variant that infected those individuals was known and VariantSpark was able to detect differences between the two study cohorts.
The research drew on the largest genomic database for covid viral genomes which stores nearly 13 million genome samples. Professor Bauer said it’s a staggering amount, but the problem is that the samples are not annotated with the outcome of the patient.
“There’s a huge problem that needs to be addressed. We could only actually analyse 0.3% of the samples because the rest of the data was missing. We really need to close the gap between the laboratory that does the genome sequencing and the healthcare system looking after the patient. That is a missed opportunity,” Professor Bauer said.
Professor Bauer flagged that interoperability needs to be at a global level.
“If you do this in Australia, that’s great. But if there’s a variant wave coming from outside, we’re still not prepared,” she said.
VariantSpark was originally was developed to find disease genes such as motor neuron disease, and was designed to process immense datasets. It was pivoted to microbial research, specifically covid, because it was “such a universal tool for understanding genomic function and consequences of genomic mutations,” Professor Bauer said.
The technology behind the powerful machine learning tool started with Google’s MapReduce compute acceleration strategy, which powers Google’s search engine and other big data framework.
“But the problem with MapReduce is that it’s fairly scripted in how it can send out tasks and how it receives it. It’s not very amenable to machine learning tasks where a lot of things are in memory and need to be iteratively improved,” Professor Bauer said.
The next version, ApacheSpark, was a more memory-efficient way of distributing CPU tasks. Professor Bauer said that VariantSpark was built on top of ApacheSpark to farm out a lot of CPU horizontally across a large array of different hardware.
The power of VariantSpark could also be applied to other viruses.
“It has the potential to become the international standard of disease surveillance. If we remove the lack of interoperability in the healthcare system then we could identify which mutations we really need to look out for; which mutations are truly dangerous to humans,” Professor Bauer said.