Via ars technica
 
-----
 
 When it comes to data storage, efforts to get faster access grab most of
 the attention. But long-term archiving of data is equally important, 
and it generally requires a completely different set of properties. To 
get a sense of why getting this right is important, just take the 
recently revived NASA satellite as an example—extracting anything from 
the satellite's data will rely on the fact that a separate NASA mission had an antiquated tape drive that could read the satellite's communication software.
 
One of the more unexpected technologies to receive some attention as 
an archival storage medium is DNA. While it is incredibly slow to store 
and retrieve data from DNA, we know that information can be pulled out 
of DNA that's tens of thousands of years old. And there have been some 
impressive demonstrations of the approach, like an operating system being stored in DNA at a density of 215 Petabytes a gram.
 
But that method treated DNA as a glob of unorganized bits—you had to 
sequence all of it in order to get at any of the data. Now, a team of 
researchers has figured out how to add something like a filesystem to 
DNA storage, allowing random access to specific data within a large 
collection of DNA. While doing this, the team also tested a recently 
developed method for sequencing DNA that can be done using a compact USB
 device.
 
Randomization
 
DNA holds data as a combination of four bases, so storing data in it 
requires a way of translating bits into this system. Once a bit of data 
is translated, it's chopped up into smaller pieces (usually 100 to 150 
bases long) and inserted in between ends that make it easier to copy and
 sequence. These ends also contain some information where the data 
resides in the overall storage scheme—i.e., these are bytes 197 to 300.
 
To restore the data, all the DNA has to be sequenced, the locational 
information read, and the DNA sequence decoded. In fact, the DNA needs 
to be sequenced several times over, since there are errors and a degree 
of randomness involved in how often any fragment will end up being 
sequenced.
 
Adding random access to data would cut down significantly on the 
amount of sequencing that would need to be done. Rather than sequencing 
an entire archive just to get one file out of it, the sequencing could 
be far more targeted. And, as it turns out, this is pretty simple to do.
 
Note above where the data is packed between short flanking DNA 
sequences, which makes it easier to copy and sequence. There are lots of
 potential sequences that can fit the bill in terms of making DNA easier
 to work with. The researchers identified thousands of them. Each of 
these can be used to tag the intervening data as belonging to a specific
 file, allowing it to be amplified and sequenced separately, even if 
it's present in a large mixture of DNA from different files. If you want
 to store more files, you just have to keep different pools of DNA, each
 containing several thousand files (or multiple terabytes). Keeping 
these pools physically separated requires about a square millimeter of 
space.
 
(It's possible to have many more of these DNA sequencing tags, but 
the authors selected only those that should produce very consistent 
amplification results.)
 
The team also came up with a clever solution to one of the problems 
of DNA storage. Lots of digital files will have long stretches of the 
same bits (think of a blue sky or a few seconds of silence in a music 
track). Unfortunately, DNA sequencing tends to choke when confronted 
with a long run of identical bases, either producing errors or simply 
stopping. To avoid this, the researchers created a random sequence and 
used it to do a bit-flipping operation (XOR) with the sequence being 
encoded. This would break up long runs of identical bases and poses a 
minimal risk of creating new ones.
 
Long reads
 
The other bit of news in this publication is the use of a relatively 
new DNA sequencing technology that involves stuffing strands of DNA 
through a tiny pore and reading each base as it passes through. The 
technology for this is compact enough that it's available in a 
palm-sized USB device. The technology had been pretty error-prone, but 
it has improved enough that it was recently used to sequence an entire human genome.
 
While the nanopore technique has issues with errors, it has the 
advantage of working with much longer stretches of DNA. So the authors 
rearranged their stored data so it sits on fewer, longer DNA molecules 
and gave the hardware a test.
 
It had an astonishingly high error rate—about 12 percent by their 
measure. This suggests that the system needs to be adapted to work with 
the DNA samples that the authors prepared. Still, the errors were mostly
 random, and the team was able to identify and correct them by 
sequencing enough molecules so that, on average, each DNA sequence was 
read 36 times.
 
So, with something resembling a filesystem and a compact reader, are 
we moving close to the point where DNA-based storage is practical? Not 
exactly. The authors point out the issue of capacity. Our ability to 
synthesize DNA has grown at an astonishing pace, but it started from 
almost nothing a few decades ago, so it's still relatively small. 
Assuming a DNA-based drive would be able to read a few KB per second, 
then the researchers calculate that it would only take about two weeks 
to read every bit of DNA that we could synthesize annually. Put 
differently, our ability to synthesize DNA has a long way to go before 
we can practically store much data.
 
Nature Biotechnology, 2018. DOI: 10.1038/nbt.4079  (About DOIs).