By Emma Huang – CSIRO Mathematics, Informatics and Statistics
Imagine a future where doctors take a strand of your hair or a drop of your blood and tell you your DNA predicts a 78 per cent risk of developing heart disease. On the plus side, it also predicts exactly which treatments will work best for you. The genetic code is well enough understood that individual predictions and treatments based on genomics are universal.
In science fiction, this is a “not-too-distant” future. The 1997 film GATTACA, directed by Andrew Niccol, describes a world driven by genomics. At or before birth, doctors calculate your probability of being diagnosed with heart disease, neurological and psychological disorders, and even predict how tall you will be or whether you will need glasses.
For better or worse, this decides the course of your life. In this world a genome sequence is not only cheap, fast, and readily available – it’s also easy to interpret.
Human Genome Project
So let’s compare science fiction with science reality. The Human Genome Project was started in 1990 with aims including sequencing the human genome, identifying and mapping all genes, improving tools for data analysis, and addressing potential ethical, legal and social issues associated with genomics.
It took 13 years and US$3 billion to complete. Since then, sequencing has rapidly become both cheaper and faster.
Now, sequencing a full human genome costs under US$10,000, and takes only two days. It’s cost-effective enough to be performed in large studies and, within a few years, thousands of human genome sequences will be available.
If progress continues at this pace, widespread acquisition of genetic data may indeed be “not-too-distant”. But, translating that data into knowledge through analysis and interpretation will take much longer.
Translating DNA into traits
Image: lunar caustic
How do we translate the 3 billion As, Cs, Ts and Gs of the human genome into a precise description of an individual’s risk for disease?
We need to identify which parts of the genome are associated with disease. More than 99.5% of the genome is identical between two humans, but that still leaves 15m positions to search through, which is akin to finding needles in a haystack.
Many different methods exist to essentially look for genetic variants which are over-represented in individuals with disease relative to those without. These work well in cases where only one variant causes disease (such as Huntington’s disease).
Indeed, more than 3,000 traits already have a causal gene characterised and annotated. This success forms the basis for a public-health funded system in Australasia to screen newborns for about 30 rare diseases, including cystic fibrosis and hypothyroidism.
The similarities with newborn testing in GATTACA are clear, but in science reality the testing won’t tell you anything about your child’s risk for heart disease.
For this and other more common diseases (such as Alzheimer’s and diabetes) there are many different genetic factors influencing risk. If these act in pairs, the size of the search problem is increased from millions to hundreds of trillions of possibilities. Further, we often need to consider networks of genetic variants, where changes in one pathway have flow-on effects to other regions of the genome.
Can we decode the DNA code?
There’s been some success in identifying individual variants affecting disease – at the end of 2011 more than 1600 studies had reported regions of the genome associated with about 250 different traits. But these rarely account for the full spectrum of genetic effects expected for a given disease.
For human height, which, unlike many disorders, is easy to measure and highly influenced by genetic factors, the nearly 200 variants discovered account for only a small portion of the overall variation. Improving individual prediction will require researchers in biology, statistics, computational science and many other disciplines to work together for years to come.
An overview of the structure of DNA. Image: Michael Ströck
Even if we completely understood the language of DNA sequence, it’s just the tip of the iceberg. Not only are there many effects of DNA beyond that of the sequence (such as epigenetics) – the fact that DNA is only part of the story can be seen in the fact identical twins are not identical people.
Genetics must be considered in the context of environment to improve prediction, as it plays a huge developmental role. A relatively simple example is the disease phenylketonuria, which can cause mental retardation and seizures. This is a disease with a known genetic cause, and is in fact one of the 30 screened for at birth.
The environment is critical in the course of the disease, as patients given a strict diet can lead a normal life. For diseases such diabetes and cancer, both genetic and environmental factors, as well as interactions between the two, must be considered in order to produce accurate models.
There are certainly success stories in personal genomics, and it has the potential to change medicine and society as a whole.
But even if the future is bright, we’re still a long way from making science fiction into science reality.