By Charles H. Weaver, MD, and Dayna Deuter
January 2005
Introduction
Successful mapping of the human genome has led to an explosion of research in medical genomics, a field with increasingly important implications to clinical oncology. Advances in molecular biology research techniques have allowed characterization of DNA sequences, identification and quantification of transcribed DNA through microarray analysis of RNA, and identification and quantification of the end product of the genome proteins. These three disciplines within molecular biology are referred to as genomics, transcriptomics, and proteomics, respectively.
An increasing understanding of the molecular basis of disease has generated valuable tools for clinical diagnosis and treatment of cancer. Currently, medical oncologists have the ability to identify genetic determinants of cancer that allow for diagnosis of some pre-symptomatic, at-risk individuals. Also, this growing body of knowledge has identified genetic determinants that influence drug response and has led to development of targeted therapies.
Background—Molecular GeneticsDNA and Genes
The importance of DNA lies not only in its role in heredity, but also in its ability to control the formation of structural and functional substances in the cell. DNA is a double-stranded helical molecule. Each strand is composed of nitrogenous bases (two purines: adenine and guanine, and two pyrimidines: thymine and cytosine) on a backbone of deoxyribose sugar and phosphoric acid. The two strands are covalently bound at the nitrogenous bases. The genetic code is contained in the base sequence of the DNA strand in the form of triplets, which are three successive bases. The successive triplets eventually control the sequence of amino acids in a protein molecule, and thus the structure and function of the proteins synthesized in the cytoplasm of the cell.
DNA is organized into genes, which are long segments of DNA that include protein-coding regions (exons), and non-coding regions (introns). Genes are defined as the basic unit of heredity. The structure of DNA, and thus genes, remain relatively constant through cell division by replication. Both strands of DNA are used as templates for complimentary DNA (cDNA) synthesis that occurs just prior to cell division.
Gene Expression
The information contained in a given gene is expressed and regulated through a complex system that involves transcription of the DNA sequence contained in the gene into an RNA molecule, processing of that RNA, translation of the RNA into proteins, and posttranslational and cotranslational modifications of the proteins.1
Genetic code is transferred from the DNA in the cell nucleus to the cytoplasm of the cell where protein synthesis takes place by way of messenger RNA (mRNA). mRNA is an unpaired, single stranded molecule containing codons that exactly compliment the code triplets of genes. Notably, mRNA does not contain introns, the non-coding regions of DNA. mRNA is synthesized in a process called transcription. During transcription, the two strands of the DNA molecule separate temporarily and one is then used as a template for synthesis of the RNA molecule.
Genomics
Genomics is defined as the comprehensive study of whole sets of genes, gene products, and their interactions. Perhaps the highest profile accomplishment in genomics is the mapping of the human genome. Currently, the human genome is considered to be in “finished form”, consisting of approximately 3 billion base pairs per haploid set of chromosomes with fewer than 1 error per 10,000 bases of DNA.2 The Human Genome Project has identified approximately 30,000-70,000 genes and many intergenic DNA sequences. Composition of the human genome is approximately 1% exons, 24% introns, and 75% intergenic DNA. More than half of the total genome (55%) is intergenic DNA comprised of repetitive sequences. 1 Thus, the human genome is made up of predominantly non-coding sequences.
DNA sequencing entails mapping the nucleotide chain. This is accomplished with the use of specific chemical analogues of nucleotides (dideoxynucleotides), which are capable of terminating chain extension at specific nucleotide bases. Synthesis of the complementary strand of DNA is initiated, then halted at specific nucleotide bases, resulting in the synthesis of DNA fragments with lengths that vary by 1 nucleotide. These fragments are radiolabeled or tagged with fluorescent dyes and then separated by size with gel electrophoresis, allowing identification of the specific nucleotide at the end of each successive fragment.1
Since the human genome is so large and so much of it is comprised of repetitive and non-coding sequences, a method for generating only expressed sequences has been developed. Expressed sequence tags (ESTs) are transcribed from mRNA back into DNA. Thus, EST’s represent only those DNA sequences that impact protein synthesis.3
Laboratory Methods used for Molecular Genetic Testing
DNA testing investigates alterations in a gene that results in disease. Finding a disease-causing mutation confirms the suspected clinical diagnosis, identifies a disease carrier, or shows a pronounced genetic disposition to disease. Several techniques have been developed recently:
Polymerase chain reaction (PCR): PCR is an in vitro laboratory method that is useful for genetic testing for disease and detecting minimal residual disease because it amplifies a segment of DNA from a small sample, making it detectable. With PCR, relatively small sequences of known DNA can be replicated into millions of copies over a short period of time.
This method requires four principle components: 1) the sample DNA, 2) an ample supply of nucleotides, 3) a heat-stable polymerase enzyme which is responsible for copying DNA, and 4) primers, short sequence of nucleotides which lie on either side of the DNA fragment of interest and signal the polymerase to begin replication of the specific DNA segment.
PCR is a three step process with each occurring at a different temperature. The sample DNA is first heated to approximately 90ºC in order to separate the 2 paired DNA strands. Once separated, the sample DNA is cooled to a temperature that allows the primers to hybridize to their complementary sequence on the target DNA, approximately 40ºC. Lastly, DNA replication occurs at approximately 70ºC, the temperature at which DNA polymerase is most active. This process is repeated 20 to 30 times, resulting in approximately 1 million-fold amplification of the DNA fragment of interest.1
Reverse transcription (RT)-PCR: Detecting and quantifying specific mRNA from biological samples is another application of genomic research that utilizes a specialized laboratory method called reverse transcription (RT)-PCR. In RT-PCR, traditional PCR is combined with reverse transcription to amplify mRNA, and therefore the DNA sequences that are expressed. Recent uses of RT-PCR in clinical oncology include detection of lymph node micrometastases in prostate cancer and bone metastases in breast cancer.1
Restriction enzyme analysis: Restriction endonucleases are enzymes that digest DNA at specific sites that are marked by a 4- or 8-member specific nucleotide sequence. The use of restriction digestion of DNA is the key to recombinant technology and gene mapping. Locating a disease gene is the first step toward cloning the gene itself.
Bioinformatics in Genomic Research
The field of bioinformatics has developed in response to the enormous amount of genomic information that has resulted from research in molecular biology. Bioinformatics is the application of information science, computer science, mathematics, and other associated technologies to biology. Completion of the human genome mapping was advanced by approximately 3 years due to the contribution of bioinformatics.
Another significant contribution that bioinformatics has made to genomic research is the development and maintenance of nucleotide sequence databases. These databases are valuable because they provide critical tools for the prediction of sequence and structure for a given nucleotide sequence. Accurate structural information is critical when trying to locate new genes because coding regions only account for 1%-2% of the human genome. The structure information helps accurately predict the location of these coding regions. Also, databases with genomic information for many organisms are useful for comparing gene sequences and predicting function. Because many genes are repeated in different species, their function can be more easily studied in small organisms, such as yeast, then extrapolated to organisms with larger genomes that are more difficult to study.4
There are three major databases of publicly available information: GenBank, EMBL, and DNA Data Bank of Japan. The format of each varies substantially, but they contain identical sequence information including both DNA and cDNA sequences. GenBank is the National Institute of Health's genetic sequence database. Currently, GenBank contains more than 17 billion bases from more than 100,000 species. This is an uncurated database, meaning it is highly redundant.5
Genomics in Clinical Oncology
Advances in genomics have led to promising applications in clinical oncology. In particular, the growing body of knowledge regarding the molecular basis of disease has generated valuable tools for diagnosis and treatment of various cancers. Molecular diagnostics has proved useful for:
Identifiying presymptomatic at-risk individuals
Determining optimal treatment for patients that have cancers with a genetic component
Predicting drug response
Developing targeted therapies.
Identifying presymptomatic at-risk individuals: Some useful markers of cancer susceptibility have been identified. However, the appropriateness of genetic-genomic screening for cancer in asymptomatic patients involves critical ethical issues and depends on test sensitivity, specificity, and feasibility.
Perhaps the most commonly recognized genetic predictors of cancer are the BRCA1 and BRCA2 tumor suppressor genes, the mutation of which confers a lifetime risk for breast and ovarian cancer of 50-85% and 15-45%, respectively. Genetic testing is also used to identify individuals with two types of colorectal cancers, hereditary non-polyposis colorectal cancer or familial adenomatous polyposis, or to guide screening and predictive genetic testing in at-risk family members. Also, an increased risk for bladder cancer can be detected in cigarette smokers who possess a mutation of N-acetyltransferase, the enzyme which is responsible for detoxifying aromatic amines, known carcinogens found in tobacco smoke.6
Determining optimal treatment: Knowing the genomic constitution of patients with malignancy may substantially affect clinical practice. Patients with mutations in cancer susceptibility genes have an unusually high risk of developing more than one primary malignancy, thus the treatment strategy for these individuals may differ from that recommended for an individual without a genetic predisposition. For example, HER2/neu status has been useful for distinguishing patients at higher risk of breast cancer recurrence.6
Predicting drug response: The effect of genetic variation on drug metabolism is an area of intense research. Currently, tests are clinically available to determine genetic variation in two key enzymes involved in the metabolism of many drugs-- cytochrome P-450 (CYP) and methyltransferase. In the future, pharmacogenomic assays may be available to determine whether a particular patient is likely to be a responder or a nonresponder to treatments such as cholesterol-lowering agents, antipsychotic or antihypertensive medications, and chemotherapeutic agents.6
Developing targeted therapies: An increased understanding of molecular genetics has allowed insight into the molecular basis of various malignancies. This has led to the development of targeted therapies based on the molecular difference between malignant and normal cells. For example, the development of a targeted therapy for chronic myeloid leukemia (CML) resulted from identification of the molecular basis of the disease. Investigations determined that the basis of CML is a reciprocal translocation between the long arms of chromosomes 9 and 22. The gene resulting from this fusion produces tyrosine-kinase activity that causes activation of various signaling pathways, ultimately leading to alteration in the proliferative and survival properties of CML cells.6 Recently, the Food and Drug Administration approved an oral drug, Gleevec®, for the treatment of CML that was specifically designed as a selective inhibitor of this tyrosine kinase activity.
In addition, the discovery of the HER2/neu oncogene has led to the targeted treatment, Herceptin®, for cancers that overexpress HER2. Herceptin® is a monoclonal antibody which targets the protein product of the HER2/neu oncogene and inhibits proliferation of these tumor cells. The FDA has approved Herceptin® as a single-agent treatment of metastatic breast cancer that is characterized by HER2 overexpression. Treatment with Herceptin® has produced superior median overall survival rates in patients with metastatic breast cancer, compared with other second-line chemotherapy.6
Transcriptomics--Microarray Technology
Much of the knowledge that has been gained in genomics has been due to DNA microarray analysis. Microarray analysis examines gene expression at the level of mRNA transcripts. Microarray data is generated through the interaction, by base pairing, between cDNA samples, called targets, and DNA molecules derived from comprehensive public databases, called probes. The abundance of specific targets represents the amount of mRNA from the original sample. This abundance is detected by the amount of base-pairing which occurs with the probes. Genes that are up-regulated will result in more mRNA, and thus more base-pairing in the microarray analysis, and vice versa for genes that are down-regulated.
The objectives of analyzing microarray data are three-fold:
To identify altered genes or biochemical pathways associated with particular diseases
To identify new molecular classes of disease (class discovery)
To predict diagnosis and classification of unknown samples (class prediction)
Through DNA microanalysis, an up-regulated or down-regulated gene can be identified across a population of patients with the same disease. Once identified, this molecular basis of disease may be targeted in the development of new drugs. Constructing a gene profile for a disease from the evaluation of many samples leads to class discovery, and subsequently, more accurate diagnosis of new samples, known as class prediction.
For example, microarray analysis has contributed to oncology with expanded understanding of the molecular basis of B-cell non-Hodgkin's Lymphoma (BCNHL), acute leukemia, and breast cancer. Considerable knowledge regarding the pathology of BCNHL has been gained by comparing gene expression patterns of diseased and normal tissue. Two morphologically distinct disease categories display equally distinct gene expression profiles. Establishment of these expression profiles may help in accurately classifying new cases of BCNHL (class prediction). In the case of acute leukemia, distinct gene expression patterns were generated that highly correlated with acute lymphocytic leukemia (ALL) and acute myeloid leukemia (AML). Using these profiles, 29 of 34 new cases of leukemia were correctly predicted. Identification of two distinct gene expression profiles in breast cancer, BCRA1 and BCRA2, suggests different pathways of disease pathogenesis and provides clues that promote further understanding of the cause of breast cancer.7
Steps in Microarray Analysis
While different methods for microarray analyses are utilized, each consists of five basic steps: 1) preparation of the target, 2) hybridization, 3) scanning, 4) normalization and 5) computational analysis.
In the initial step, cDNA is synthesized by reverse transcription from RNA that has been extracted from both a test and a reference sample.
During hybridization, the targets are incubated with probes on a computer chip called a microarray chip, which consists of a rectangular grid of “spots” on which the probes are attached. The targets are labeled with fluorochromes or radioactive isotopes so that they can be detected after hybridization.
Once hybridization is complete, scanners are used to detect and digitally image the fluorescence from array spots with successful probe-target linkage.
Because raw signal intensities may vary between individual chips from many patients or experiments, individual chip intensity must be adjusted to a common standard, or normalized. For example, subtraction of background noise is a common normalization method that is applied to all samples. Normalization makes it possible to compare gene expression profiles from many patients or experiments.
The final step in a microarray experiment is computational analysis. The thousands of raw data points that result from microarray analyses are essentially unintelligible unless aggregated across populations in order to identify similarities and differences in expression, resulting in genetic markers of disease and molecular targets for intervention. For example, the gene expression profile of normal and diseased tissue can be compared to identify genes that are differentially displayed. Also, clustering of similarly expressed genes may generate a pattern (profile) that may be useful in the separation of distinct classes or stages of disease.7
Bioinformatics in the Analysis of Microarray Data
Bioinformatic tools that have been most commonly used to analyze microarray data include gene filtering and two types of gene-grouping techniques referred to as hierarchical clustering or self-organizing maps.
Gene-filtering: Gene-filtering reveals unusual pattern of expression in abnormal or diseased tissue compared to healthy tissue, including over-expression or under-expression of genes. 8
Hierarchical clustering: Hierarchical clustering imposes a hierarchy of aggregation as similarities in expression patterns are identified. This hierarchy provides a platform from which to sort samples according to clinical feature, such as certain diseases or conditions that may demonstrate differential expression compared to healthy subjects. The limitation of hierarchical clustering is that errors may occur when unrelated clusters are forced to converge.
Self-organizing maps: Self-organizing maps are useful for exploratory analysis of data that lacks uniformity and contains a high percentage of irrelevant information. This method imposes a partial structure on the data. The resulting vectors represent the gene expression groupings, which can also be sorted according to clinical features or diagnoses.8
Proteomics
While much knowledge has been gained from microarray analysis, study at the level of the end product of the genome, the proteome, exposes more details than study of transcribed DNA. Ideally, proteomics characterizes and quantifies all proteins in a specific cell type under a specific set of environmental conditions. Analysis at the level of the protein is necessary because the abundance, activity, and function of proteins are not exclusively determined by gene expression. For example, virtually all eukaryotic proteins undergo post-translational modifications, such as the cleavage or the addition of a chemical group, sugar, or lipid after protein synthesis. These modifications often have functional consequences.
Protein analysis and defining a cell’s protein is considerably more complex than genomics in several ways. Although the genome is relatively constant, the proteome is in constant flux. There is a complex interplay of molecular events that occur between gene activation and synthesis of functional proteins. The global protein complement of a given cell varies with changes in the physiological state of the cell, such as activation of specific cellular signaling pathways, position in the cell cycle, and its ambient environment, including drug exposure. Furthermore, the structure of proteins is more complex and diverse compared to DNA structure. While DNA consists of 4 basic building blocks, there are 22 unmodified and many more modified amino acids which make up proteins.
Methods for Proteome Analysis
Currently, the preferred methods for proteome investigation are protein separation by two-dimensional PAGE and protein identification and characterization by mass spectrometry.
Two-dimensional PAGE (2-D PAGE): 2-D PAGE separates proteins first by charge (isoelectric point [pI]), then according to molecular mass. While this is the traditional method for protein separation, it does have some limitations. First the quality and relevance of results from 2-D PAGE reflect experimental design and sample preparation. Also, 2-D gels exhibit variability in observed results making comparative analyses between laboratories and with archived databases difficult. Finally, low-abundance proteins, membrane proteins, very acidic or very basic proteins, and proteins sequestered in organelles tend to be poorly represented and relative to their abundance in the cell. Advances in protein separation allow for more reproducible results, enhanced solubilization of membrane proteins, availability of 2-D image analysis software, and automation and high-throughput analysis of protein expression.
Mass Spectrometry (MS): MS is an analytical technique that provides highly accurate molecular mass measurements, and is the method of choice for protein identification and characterization. Protein identification involves matching experimentally derived protein attributes, such as molecular mass, pI, and amino acid sequence, against those predicted from the translation of genomic or cDNA sequences in databases. Currently peptide mass fingerprinting, a type of MS, is the most commonly used method for rapid and sensitive protein identification.
Peptide Mass Fingerprinting (PMF): In PMF, proteins of interest are first isolated by 2-D PAGE, then cleaved in gel or on membranes by enzymatic or chemical methods. Next, the masses of resulting peptides, which correspond to the specific amino acid sequence of an individual protein, are measured in a mass spectrometer. Finally, proteins from a nonredundant database are cleaved using the same experimental technique as applied to the protein in question. For example, the enzyme trypsin is commonly used for cleavage. Based on the results from the database proteins, a ranking or score is calculated to provide a measure of the fit between the sample peptide masses and those predicted from the database. The correct identification of an unknown protein is the candidate with the largest number of matched peptide masses, or “hits”.
The number of candidate proteins is decreased when sample information is maximized. Specifying additional attributes, other than experimental peptide masses and enzymes used to digest the proteins, will increase the probability of accurate protein identification. Some additional attributes include molecular weight and/or pI of the protein, known posttranslational or potential modifications, protein N- and C-terminal sequence tags, and minimum number of matching peptides required for a protein to be a suggested possible match. However, the correct protein may be missed if the search is overly constrained.5
Proteomic Databases
A critical component of accurate protein identification through PMF is the availability of non-redundant proteomic databases of amino acid sequences. Bioinformatics makes a significant contribution to proteomic research through the development and maintenance of a vast number of proteome databases. Swiss Prot is a major database of highly curated protein sequences. It is distinguished from other databases by its high degree of annotation, minimal redundance, and integration with approximately 60 other databases through cross-referencing. For each sequence entry, the annotation describes function(s), posttranslational modification(s), domains and sites, secondary and quaternary structure, similarities to other proteins, disease(s) associated with deficiencies in the protein, sequence conflicts, and variants.5,8
Software for Screening Peptides
Powerful and comprehensive software tools have been developed for the screening of peptides for PMF. These tools help identify posttranslational modifications and peptides that have resulted from nonspecific chemical or enzymatic cleavage of proteins. Some of these software applications are available online (http://us.expasy.org/tools/#ptm). MALDI-TOF MS is emerging as the best software for screening peptides because it produces singly charged ions, making results easy to interpret. In addition, it has a large mass range making it possible to analyze whole proteins and other polymers. MALDI-TOF MS also has a short analysis time, high sensitivity, and high mass accuracy.8
References
1. Tefferi A, Wieben ED, Dewald GW, et al. Primer on Medical Genomics Part II: Background Principles and Methods in Molecular Genetics. Mayo Clinic Proceedings 2002;77:785-808.
2. Wieben ED. Primer on Medical Genomics Part VII: The Evolving Concept of the Gene. Mayo Clinic Proceedings 2003;78:580-587.
3. The National Museum of Health. Genomics: Sequence, sequence, sequence. www.accessexcellence.org: Accessed July 7, 2003.
4. The National Museum of Health. Genomics: Evolution, the original Xerox machine. www.accessexcellence.org: Accessed: July 7, 2003.
5. Pardanani A, Wieben ED, Spelsberg TC, Tefferi A. Primer on Medical Genomics Part IV: Expression Proteomics. Mayo Clinic Proceedings 2002; 77:1185-1196.
6. Ansell SM, Ackerman MJ, Black JL, et al. Primer on Medical Genomics Part VI: Genomics and Molecular Genetics in Clinical Practice. Mayo Clinic Proceedings 2003;78:307-317.
7. Tefferi A, Bolander ME, Ansell SM, et al. Primer on Medical Genomics Part III: Microarray Experiments and Data Analysis. Mayo Clinic Proceedings 2002;77:927-940.
8. Elkin PL. Primer on Medical Genomics Part V: Bioinformations. Mayo Clinic Proceedings 2003;78:57-64.