A team of multidisciplinary scientists has identified a novel approach to store in DNA information—“The Wizard of Oz” translated into Esperanto here—with efficiency and accuracy like never before.
The method taps the data-storage capacity of intertwined DNA strands to encode and recover information in a compact and durable way.
The approach has been reported in a paper recently published in the Proceedings of the National Academy of Sciences.
The key breakthrough is an encoding algorithm that allows accurate retrieval of the information even when the DNA strands are partially damaged during storage.”
Ilya Finkelstein, Study Author and Associate Professor, Department of Molecular Biosciences, University of Texas at Austin
Humans have been creating information at exponentially higher rates like never before. Thus, there is a need to effectively store more information such that it will last for a longer period. Companies like Microsoft and Google are among those investigating the use of DNA to store information.
We need a way to store this data so that it is available when and where it’s needed in a format that will be readable. This idea takes advantage of what biology has been doing for billions of years: storing lots of information in a very small space that lasts a long time. DNA doesn’t take up much space, it can be stored at room temperature, and it can last for hundreds of thousands of years.”
Stephen Jones, Research Scientist, Department of Computer science and Integrative Biology, University of Texas at Austin
Jones collaborated on the project with Finkelstein; Bill Press, a professor jointly appointed in computer science and integrative biology; and PhD alumnus John Hawkins.
When compared to existing storage techniques, DNA is around 5 million times more effective. In other words, a 1 mL droplet of DNA could store the same amount of data as two Walmarts full of data servers. Moreover, DNA does not necessitate permanent cooling and hard disks that are susceptible to mechanical failures.
However, one major disadvantage is that the DNA is prone to errors. A genetic code with errors is quite different from a computer code with errors. Errors in computer codes tend to appear as blank spots in the code. By contrast, errors in DNA sequences tend to appear as insertions or deletions. Here, the issue is that when something is added or deleted in DNA, the entire sequence tends to shift, without any blank spots to alert anyone.
Earlier, when information was stored in DNA, the piece of data which has to be saved, for example, a paragraph from a novel, would be repeated 10 to 15 times. To read the information, the repetitions would be compared to remove any deletions or insertions.
We found a way to build the information more like a lattice. Each piece of information reinforces other pieces of information. That way, it only needs to be read once.”
Stephen Jones, Research Scientist, Department of Computer science and Integrative Biology, University of Texas at Austin
The language developed by the scientists also eliminates segments of DNA susceptible to errors or which are difficult to read. The parameters of the language could also get altered with the type of information being stored. For example, a dropped word in a novel is not a major data as a dropped zero in a tax return.
The researchers demonstrated data retrieval from degraded DNA by subjecting its “Wizard of Oz” code to extreme humidity and high temperatures. Although the DNA strands were destroyed by these severe conditions, all the information was still successfully decoded.
According to Hawkins, who was recently part of UT’s Oden Institute for Computational Engineering and Sciences, “We tried to tackle as many problems with the process as we could at the same time. What we ended up with is pretty remarkable.”
Source:
Journal reference:
William, H., et al. (2020) HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proceedings of the National Academy of Sciences. doi.org/10.1073/pnas.2004821117.