EVENT: Big Data for Better Medicine
Big Data for Better Medicine
UCSC has built the Cancer Genomics Hub (CGHub) for the US National Cancer Institute, designed to hold data for all major NCI projects. To date it has served more than more than 15 petabytes of data to the research committee. Cancer is exceedingly complex, with thousands of subtypes involving an immense number of different combinations of mutations. The only way we will understand it is to gather together DNA data from many thousands of cancer genomes so that we have the statistical power to distinguish between recurring combinations of mutations that drive cancer progression and “passenger” mutations that occur by random chance. Currently, with the exception of a few international research projects, most cancer genomics research is taking place in research silos, with little opportunity for data sharing. If this trend continues, we lose an incredible opportunity.
Soon cancer genome sequencing will be widespread in clinical practice, making it possible in principle to study as many as a million cancer genomes. For these data to also have impact on understanding cancer, we must begin soon to move data into a network of compatible global cloud storage and computing systems, and design mechanisms that allow genome and clinical data to be used in research with appropriate patient consent. The Global Alliance for Genomics and Health was created to address this problem. Our Data Working Group is designing the future of large-scale genomics for cancer and other diseases. This is an opportunity we cannot turn away from.
David Haussler develops new statistical and algorithmic methods to explore the molecular function, evolution, and disease process in the human genome, integrating comparative and high-throughput genomics data to study gene structure, function, and regulation. He is credited with pioneering the use in genomics of hidden Markov models (HMMs), stochastic context-free grammars, and discriminative kernel methods. As a collaborator on the international Human Genome Project, his team posted the first publicly available computational assembly of the human genome sequence on the Internet on July 7, 2000. His team subsequently developed the UCSC Genome Browser, a web-based tool that is used extensively in biomedical research and serves, along with the European Ensembl platform, virtually all large-scale vertebrate genomics projects, including NHGRI’s ENCODE project, the 1000 Genomes Project, and NCI’s TCGA. As the first designated Trusted Partner of the NIH, he built the CGHub database to hold NCI’s cancer genome data and is a co-founder and organizing member of the Global Alliance for Genomics and Health (GA4GH), a coalition of the top research, health care, and disease advocacy organizations that have taken the first steps to standardize and enable secure sharing of genomic and clinical data.