By Emma Chory, Chemical Engineering Ph.D candidate at Stanford University School of Medicine:
In 1990, the Human Genome Project was born when a group of scientists began the ambitious project of determining the sequence of bases— A’s, T’s, C’s, and G’s— that when pieced together make up the string of human DNA. Thirteen years and $3 billion dollars later, we had a “complete” map of the 3.3 billion bases that comprise the human genome. Since 2003, the genomics era has taken off and the cost of sequencing a human genome has rapidly declined. Genomicists will proudly tout that reductions in the costs of sequencing have exceeded Moore’s Law— which describes a long-term trend in the computer hardware industry in which the number of transistors that can be fit on a chip doubles roughly every two years.
Since 2011, the cost of personal genomes has continued to decline. Just last year, Illumina announced its “HiSeq X” machine which can sequence a human genome for only $1000—less than the cost of a knee MRI. To that end, human genome sequencing can now be used to not only diagnose inherited diseases such as Huntington’s disease or a predisposition to breast cancer, but it can also be used to trace ancestry, direct the course of treatment for cancer patients, and detect rare and previously undiagnosable medical conditions.
It is natural to inquire, “if the cost of sequencing a human genome is so cheap, what’s left to be innovated in sequencing?”.
The short answer to this question is that there certainly remains a need to improve the quality of the sequences in order to eliminate errors and false-positive results. Also, current sequencing methods give very little information about repetitive DNA regions which (are thought to) comprise over two-thirds of the human genome.
The long answer to this question requires understanding a key concept that has made human genome sequencing affordable, but not necessarily the genomes of other important organisms like crops, bacteria, and wildlife. This critically important concept is known as a “reference genome”. Almost all current, patient-driven human genome sequencing technologies rely on a reference genome in order to piece together the millions of tiny DNA fragments –termed “reads”— generated by a sequencing machine. The reference is, for all intents and purposes, the same single genome that was produced by the Human Genome Project in 2003 that took over a decade to compile and billions of dollars to generate. To put the utility of the reference genome into perspective, consider the analogy that piecing together a patient’s genome is much like putting together an expert-level puzzle of “Starry Night” that has no edges, some extra pieces thrown in, and a few different paint strokes from the original. Now consider assembling the same puzzle, but instead of being able to use an iconic painting as a reference, that all the pieces are blank. The latter takes much more brain power, more sophisticated planning, and time. The former represents the massive impact that the reference genome has made on reducing the time, costs, and ease of human genomics.
Even when comparing to the human reference genome, piecing reads together is non-trivial, computationally intensive, and error-prone. In fact, many of the major advances that have contributed to the rapidly decreasing cost of a human genome have been strictly computational and algorithmic improvements. The cost-effective sequencing technologies from major players (Life Technology’s IonTorrent and Illumina’s HiSeq), generate millions of very short reads very quickly and compile a complete sequence by comparing them to the reference, like looking at the picture on the front of a puzzle box.
Despite the major advances in sequencing human genomes, the genomes of most other organisms have remained a mystery because making reference genomes requires much longer reads which are more difficult and expensive to generate. Few people in the 90’s, for example, would have supported spending $1 per base to create a reference genome for the honey badger (despite its ability to face off with lions, survive rattlesnake bites, and be a generally ravenous and impressive animal).
Fortunately for honey badger enthusiasts, there have been many orthogonal advancements in technologies which can generate longer reads, and this weasel’s reference genome was assembled just this year as a part of the “10,000 Genomes Project”. As a result of these advancements, scientists have begun to sequence the genomes of populations of healthy microbes in the human gut, the globally important crop bread-wheat, bacterial pathogens in waste water, and plants and animals that can survive extreme weather including drought, frost, and high altitudes. These genetic discoveries will enable the development of future staple crops that can feed a growing global population. Further, technologies like Oxford Nanopore’s MinION, Pacific Biosciences’ real-time RS II, and BioNano Genomics have begun to focus the core of their business on generating new or de novo genome sequences which will advance similar causes.
For the future of human genomics, the major advancements to come relate to increasing efficiency, decreasing cost of devices, improving computation, and applying lessons learned to patients in the clinic. For the other ~8.7 million species on the planet, there are still huge improvements to be made related to whole genome assembly, as the quality of long reads still falls short of other next-generation counterparts. Regardless, the seemingly endless insights to be gained from sequencing new and diverse species will fuel the continued investment in groundbreaking sequencing technologies.