Via ars technica
-----
When it comes to data storage, efforts to get faster access grab most of
the attention. But long-term archiving of data is equally important,
and it generally requires a completely different set of properties. To
get a sense of why getting this right is important, just take the
recently revived NASA satellite as an example—extracting anything from
the satellite's data will rely on the fact that a separate NASA mission had an antiquated tape drive that could read the satellite's communication software.
One of the more unexpected technologies to receive some attention as
an archival storage medium is DNA. While it is incredibly slow to store
and retrieve data from DNA, we know that information can be pulled out
of DNA that's tens of thousands of years old. And there have been some
impressive demonstrations of the approach, like an operating system being stored in DNA at a density of 215 Petabytes a gram.
But that method treated DNA as a glob of unorganized bits—you had to
sequence all of it in order to get at any of the data. Now, a team of
researchers has figured out how to add something like a filesystem to
DNA storage, allowing random access to specific data within a large
collection of DNA. While doing this, the team also tested a recently
developed method for sequencing DNA that can be done using a compact USB
device.
Randomization
DNA holds data as a combination of four bases, so storing data in it
requires a way of translating bits into this system. Once a bit of data
is translated, it's chopped up into smaller pieces (usually 100 to 150
bases long) and inserted in between ends that make it easier to copy and
sequence. These ends also contain some information where the data
resides in the overall storage scheme—i.e., these are bytes 197 to 300.
To restore the data, all the DNA has to be sequenced, the locational
information read, and the DNA sequence decoded. In fact, the DNA needs
to be sequenced several times over, since there are errors and a degree
of randomness involved in how often any fragment will end up being
sequenced.
Adding random access to data would cut down significantly on the
amount of sequencing that would need to be done. Rather than sequencing
an entire archive just to get one file out of it, the sequencing could
be far more targeted. And, as it turns out, this is pretty simple to do.
Note above where the data is packed between short flanking DNA
sequences, which makes it easier to copy and sequence. There are lots of
potential sequences that can fit the bill in terms of making DNA easier
to work with. The researchers identified thousands of them. Each of
these can be used to tag the intervening data as belonging to a specific
file, allowing it to be amplified and sequenced separately, even if
it's present in a large mixture of DNA from different files. If you want
to store more files, you just have to keep different pools of DNA, each
containing several thousand files (or multiple terabytes). Keeping
these pools physically separated requires about a square millimeter of
space.
(It's possible to have many more of these DNA sequencing tags, but
the authors selected only those that should produce very consistent
amplification results.)
The team also came up with a clever solution to one of the problems
of DNA storage. Lots of digital files will have long stretches of the
same bits (think of a blue sky or a few seconds of silence in a music
track). Unfortunately, DNA sequencing tends to choke when confronted
with a long run of identical bases, either producing errors or simply
stopping. To avoid this, the researchers created a random sequence and
used it to do a bit-flipping operation (XOR) with the sequence being
encoded. This would break up long runs of identical bases and poses a
minimal risk of creating new ones.
Long reads
The other bit of news in this publication is the use of a relatively
new DNA sequencing technology that involves stuffing strands of DNA
through a tiny pore and reading each base as it passes through. The
technology for this is compact enough that it's available in a
palm-sized USB device. The technology had been pretty error-prone, but
it has improved enough that it was recently used to sequence an entire human genome.
While the nanopore technique has issues with errors, it has the
advantage of working with much longer stretches of DNA. So the authors
rearranged their stored data so it sits on fewer, longer DNA molecules
and gave the hardware a test.
It had an astonishingly high error rate—about 12 percent by their
measure. This suggests that the system needs to be adapted to work with
the DNA samples that the authors prepared. Still, the errors were mostly
random, and the team was able to identify and correct them by
sequencing enough molecules so that, on average, each DNA sequence was
read 36 times.
So, with something resembling a filesystem and a compact reader, are
we moving close to the point where DNA-based storage is practical? Not
exactly. The authors point out the issue of capacity. Our ability to
synthesize DNA has grown at an astonishing pace, but it started from
almost nothing a few decades ago, so it's still relatively small.
Assuming a DNA-based drive would be able to read a few KB per second,
then the researchers calculate that it would only take about two weeks
to read every bit of DNA that we could synthesize annually. Put
differently, our ability to synthesize DNA has a long way to go before
we can practically store much data.
Nature Biotechnology, 2018. DOI: 10.1038/nbt.4079 (About DOIs).