The original Human genome project was started in 1990, and aimed to determine the entire sequence within 15 years. In 2000, a rough draft was published to celebrate the international cooperation that was making this achievement possible. This draft was an impressive feat on its own, covering 83% of the genome.
In 2003, the Human genome was announced to be ‘essentially complete’. 99% of the gene rich DNA had been sequenced. The importance of noncoding DNA was not fully appreciated, so most people didn’t realize that 8% of the DNA letters still hadn’t been determined
Ridiculously, only 1% of the human genome actually codes for proteins. Most of it regulates when and where those protein coding genes are switched on. There are also countless stretches of nonsense letters that go on and on with seemingly no purpose. These regions are called ‘junk DNA’.
An especially interesting example of junk DNA is the repeats. Instead of mutating randomly, some sections get over-replicated. The DNA polymerase enzymes slip whilst reading along the double helix, causing these repeats to grow and grow over generations. This is implicated in several diseases such as huntington’s and cancer.
DNA is ‘read’ like morse code except instead of having 2 letters (dot and dash) it has 4: A T G C. This allows it to code for the amino acid alphabet in a shorter space. Each amino acid corresponds to a unique sequence of 3 DNA bases.
Amino acids are the building blocks of protein. Each of them have slightly different properties such as positive/negative charge and hydrophobicity. The genetic code is important because it specifies which order these building blocks should be linked together in a chain. Each chain folds into a specific 3D shape with an important function, and mutation of a single DNA letter can completely ruin this. For example, Sickle cell anemia is caused by mutation of A->T in hemoglobin, leading to a glutamate amino acid being swapped out for valine.
Despite being just a simple repeating pattern, these regions were practically impossible to sequence with the techniques available during the 1990’s. This is because the techniques involved digesting the DNA into tiny fragments and determining the bases bit by bit. Then they would see where each fragment overlapped and put them all in order. Unfortunately, repeat regions can look like utter nonsense.
“cccccc aaaaa once upon a time there aaaaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbb ccccccccccccc was a man called bbbbbbbbbbbbbbbbbbbbbbbbbbb Steven bbbbbbbbbbbbbbbbbbbbbbbbbb aaaaaaaa”
The sentence is interrupted by 3 repeated stretches of ‘b’. After this sentence is cut up for shotgun sequencing, imagine if the cuts are in the middle of the b repeats. You would have no way of knowing which order they are supposed to fit together in! You could end up with a sentence like this:
“cccccc aaaaa once upon a time there aaaaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbbbb Steven bbbbbbbbbbbbbbbbbbbbbbbbbbbbb ccccccccccccc was a man called bbbbbbbbbb”
Over the last few decades, more accurate and reliable sequencing techniques have been developed such as oxford nanopore. This has allowed these mysterious regions to be explored. The Telomere-to-Telomere (T2T) Consortium is a team of over 100 scientists that are working to fill in the gaps.
We have 2 copies of the genome in our cells, 1 from each parent. Keeping track of which half of chromosomes are which could be a massive problem and cause alignment errors. Therefore, this new sequencing project used a special type of cell called a Hydatidiform mole. This is an empty egg cell lacking its own copy of the genome, fertilized by a healthy sperm cell. The sperm’s DNA becomes more accessible and easy to sequence.
The downside of this method is that it leaves out one of the sex chromosome because sperm can only be X or Y. They chose an X sperm because it is far bigger and would be difficult to obtain on its own. To solve this, Leonid Peshkin (a biologist at Harvard University) donated a Y chromosome sample from his own genome.
Oxford Nanopore sequencing was used to fill in gaps in the centromere. These are patches of each chromosome that don’t contain any genes; instead they serve as handles for the spindle proteins to grab hold of during mitosis. A different technique called PacBio HiFi was used for many repeating sequences. The key thing that both techniques have In common is that they are ‘long-read’ and able to process huge fragments – hundreds of thousands of letters at a time.
After this brilliant effort, all the gaps in the human genome have been filled except for 5. Only ~10 million letters remain to be sequenced.
Even after these tiny gaps are closed, the goal of the human genome project is not quite fulfilled. There are many genes that have significant variations throughout the population. You will have no doubt observed this for superficial things like eye and hair colour, but more pernicious differences lurk below the surface.
Mutant versions of key signalling proteins are linked with an increased risk for diseases such as cancer and Alzheimer’s. Understanding how their genes vary in different people could be an important step in treating these devastating conditions. The Human Pangenome Project are working towards sequencing 350 different genomes from diverse populations.