© Cate Gillon
Tim Hubbard claims he knows nothing about genetics. But he was drawn into the
high-stakes world of genomics by a job offer he couldn’t refuse. Hubbard had
been working on algorithms for predicting protein structures at the MRC Centre for
Protein Engineering in the United Kingdom when he noticed that the Sanger Institute in
Hinxton was looking to hire some new bioinformaticists. “I really wanted to
continue what I was doing,” he recalls. “But when I came to
interview, they said, ‘Well, that would be fine, but, there’s also a
more senior position open. It would just involve looking after the annotation of the
human genome, which would hardly take up any of your time.’” Hubbard
hasn’t done any structure prediction since.
When he
arrived in 1997, Sanger was “a sequencing factory,” says Hubbard.
Scientists at the Institute were just wrapping up the worm sequence and were gearing up
to tackle the human genome. Then came Celera’s announcement that it, too, had
its eye on the prize. “Two days later,” says Hubbard, “the
Wellcome Trust doubled the amount of money we’d been given because it was
believed that the human genome must remain in the public domain.”
And so ensued a scramble to figure out how to make that happen.
“I thought that if we could work out a pipeline to annotate the data in real
time, we could put that information out there quickly,” says
Hubbard—a move he hoped would “lift the bar” by preventing
people from trying to claim patent rights on every string of nucleotides that looked
like it might contain an interesting human gene.
That automated
annotation system—which analyzed sequence data and flagged the potential
genes—evolved into Ensembl, a one-stop shopping source of information for
vertebrate and eukaryotic genomes. “Ensembl was an incredibly ambitious
project and is now a vital resource for the community,” says David Haussler of
the University of California (UC), Santa Cruz. “Tim has strengthened that
system with his ability to see all the way from DNA to protein structure. Very few
people really understand the details of that entire spectrum. Tim is one of those rare
individuals. He’s done great work.”
“Tim is a real hacker in the best sense of the word. He codes to solve
problems. He works quickly and then moves on. He doesn’t show off,
doesn’t do anything unnecessary— he just gets the work
done.” —Jong Bhak, director of the Korean Bioinformation
Center
That work
also catapulted Hubbard to the forefront of genome informatics. “Tim has taken
on a leadership role in organizing human genetic information and discussing how that
information can be used for different purposes, including questions of
healthcare,” says Steven Brenner of UC Berkeley. He has also taken over as the
central coordinator for all informatics at the Sanger. “If you look at what
the Sanger Center does that’s important, Tim has a major role in a lot of
those activities. So he’s really had a huge impact on the whole
field.”
PROTEIN PROPHECIES
As a boy, the fields Hubbard
influenced were on the family farm—although he tended to view things from an
engineer’s perspective. “In summer I had to fetch and carry bales of
hay,” he says. “But I was more interested in whether I could get a
better packing arrangement on the truck so I could carry more at once.”
As a graduate student at the Birkbeck College, University of
London, Hubbard seriously flexed his engineering muscles as he attempted to
design—or redesign—a protein. He started with an eye-lens protein
called crystallin, as its structure was being solved in the lab at the time, and he
added what he thought would be a copper-binding site. “At that stage we
weren’t looking to make something useful,” Hubbard says.
“We just wanted to see what you could do.” In hindsight, adding a
metal binding site was a bit “outrageous,” because he was inserting
a cluster of charged amino acids into the middle of a highly structured protein.
“The histidines I put in probably wound up just flopping around the
surface,” says Hubbard. But he was able to synthesize the altered gene and
express his souped-up protein, which he called crystanova. “So I got a band on
a gel and my PhD.”
It was during his postdoctoral
fellowship in Japan in 1989 that Hubbard heard about a new center for protein
engineering being set up in association with the MRC Laboratory of Molecular Biology in
Cambridge. When he moved to MRC, Hubbard decided that it might be easier to predict
protein structures, based on their amino acid sequences, than to learn about structure
by trying to design new polypeptides. As part of the process, he and his
colleagues—including Brenner and Alexey Murzin, who were then at the
MRC—formed the first comprehensive database that classified proteins according
to their structural and evolutionary relatedness. Proteins from the same family often
share similar features. So checking the database, called SCOP, could help investigators
predict a protein’s structure by seeing what its relatives look like.
But SCOP did not solve the biggest practical problem facing
structure prediction: how to design a fair test to see if your algorithm works? If you
train your program on a structure that’s already known, how can you be sure
you didn’t arrive at the correct structure because you knew the answer in
advance? And if you attack a structure that hasn’t yet been solved, Hubbard
says, “you might have to wait 20 years to find out whether you got the answer
right.”
The solution was CASP: a competition in which
programmers unleash their algorithms on a set of protein sequences whose structures have
recently been solved but are kept secret until the meeting. “It was very
exciting,” says Haussler. “Like the Academy Awards. ‘And
the winner is…’”
Well, no one,
really. At least at the first CASP in 1994. “Basically everyone did
appallingly badly,” says Hubbard, who submitted his own predictions in CASP1
and then helped to organize CASP2 through CASP7. “The gap between how good we
thought we were and how good we really were was huge.” But the quality of the
predictions—and the number of participants—has since increased.
“Now everyone in the field has to take part in this meeting if they want to be
taken seriously,” says Hubbard.
CRYSTAL-BALL CODING
Predicting where genes are is just as challenging as predicting protein
structure—if not more so. “At least with structure prediction you
know what the real answer is: you do an x-ray crystallography study and you get the
structure,” says Hubbard. “But in the case of genomic sequence,
well, how many genes are there?” In his early days at the Sanger,
Hubbard tested out the gene-predicting algorithms of the day by scanning a 1.2-megabase
region around the BRCA2 gene. Because the region had been studied
extensively, he knew it housed eight genes with quite a lot of exons. He discovered that
even the best programs tended to overestimate the number of exons. “And if you
tried to predict whole gene structures, it was much, much worse,” he says. But
feeding the algorithms experimental data—for example, snippets of sequence
that were found to be expressed in living cells—made the predictions much more
accurate. “We have a pretty simple standard,” says Paul Flicek, a
colleague at the European Bioinformatics Institute (EBI), which shares a campus with
Sanger. “We want to get all the genes right, all of the time.”
By coupling computation with experimental data, Hubbard wrote a
program that assembled the first gene set for the human genome, which hit the Web in
1999. “Tim is a real hacker in the best sense of the word,” says
former student Jong Bhak, director of the Korean Bioinformation Center. “He
codes to solve problems. He works quickly and then moves on. He doesn’t show
off, doesn’t do anything unnecessary—he just gets the work
done.”
EBI’s Ewan Birney agrees.
“Tim has written some awful pieces of code that worked. Which is far better
than perfect bits of code that don’t work,” he says. “I
have fond memories of the horrendous system that originally ran Ensembl. The only person
who understood it was Tim. It was kind of hideous, but it worked.”
In the future, Hubbard says that gene-prediction programs need to
get good enough that they can find genes without the aid of experimental data or
comparative genome analyses to guide them. “Because that’s
cheating,” he says. “For example, an RNA polymerase does not go and
look at the mouse genome when it’s working out whether to transcribe a
particular stretch of human sequence. But that’s what many of our algorithms
do now.” Instead, he says that annotation programs should take an RNA
polymerase–eye-view of the sequence, modeling the biology closely enough to
accurately locate and assess the activity of genes. As we move into an era of personal
genomics, such an approach will be necessary for predicting the effect that a certain
SNP variant might have on gene function. He and his team have had some early success,
producing a transcription start-site predictor that nails about half the genes in a
genome sequence with very few false positives.
Hubbard also
spends quite a bit of time working on issues of open access and the economics of
innovation. “Governments are spending all this money for research and then not
maximizing its value because they’re not investing enough in making sure
people can access and reuse that data,” says Hubbard, who has discussed these
issues at meetings of the Organisation for Economic Cooperation and Development (OECD)
and the World Health Organization. Much of this work he does in his spare time.
“Other people go fishing,” laughs Birney. “Tim likes to
reform international patent law and go to UN conferences to discuss how open-access
agreements should be arranged to maximize the way science gets translated into
meaningful outcomes.”
Those outcomes, of course,
include potential improvements in the diagnosis and treatment of disease, which makes
the issue more urgent and more fraught. “If you look at the health
implications of all the work being done in genomics, the opportunities are tremendous
and the obstacles are staggering—and a lot of those are political,”
says Haussler. “I just have the ultimate respect for Tim, as he’s
willing to move through those political hurdles and try to get things to
happen.”
“In a way, Tim’s
contribution to the scientific endeavor is a very interesting one and rather different
from most scientists,” says EBI director Janet Thornton. “Although
he’s had a hand in producing many of the big genome publications, his unique
input lies in his broad perspective, his sense of fairness, and his openness to new
ideas. His diplomatic efforts have really been fundamental in making these large-scale,
collaborative genomics projects work—and in making the data available so that
the science can be put to good use for biology and medicine around the world.”
“A lot of things can be done by one person with a
computer,” adds Flicek. “If the Internet age taught us anything,
it’s taught us that.”
A purely computational approach will never work because it does not take into consideration, the real world. You always can use that guidance.
Try computing the purpose of "junk DNA"