Pure Sciences

Pure Sciences Paper For Sale

Modeling Smad domains and their interaction with Smurf-1, c-Ski and DNA promoter motif to design inhibitory compounds

Transforming Growth Factor-beta TGF-beta) superfamily members are known for regulating wide array of cellular processes such as growth, differentiation, proliferation, and apoptosis. In the downstream of TGF-beta signaling there are important growth and differentiation factors known as Smad proteins, which carry out the TGF-beta responsive signaling and elicit various responses once inside the nucleus. The goal of this dissertation is texplore the available structural data of some of the molecules involved in TGF-beta signaling process and to apply state of the art molecular modeling, docking and virtual screening tools and techniques to gain insight into the TGF-beta signaling pathway. This study mainly concentrates on the interaction of Smad proteins with the DNA promoter motif, and other proteins c-Ski and Smurf-1 with which they interact in the signaling process. Initially MH1 domain of mammalian Smad proteins were modeled based on known crystal structure of Smad3 MH1-DNA complex PDB ID: 1OZJ) followed by modeling of interaction pose of MH1 domain of BMP regulated Smads Smad1/5/8) with their corresponding DNA sequence motif 5-GCCG-3. In this work the key residues of MH1 domain of Smad1/5/8 interacting with GCCG motif were identified. To investigate further the solvent accessibile contact area of key residues and binding energy calculations of modeled Smad1/5/8 MH1 with the GCCG DNA motif and GTCT DNA motif were computed. Higher free energy of binding for Smad1/5/8-MH1 complexed with nonspecific GTCT DNA motif compared to the GCCG motif confirmed high specificity of Smad1/5/8 with GCCG motif indicating that these Smads may not bind with GTCT DNA. Further, homology modeling approach was followed to build Smad binding domain of c-Ski, a proto-oncoprotein, which acts as co-repressor in Smad mediated TGF-beta signaling. Various protein-protein docking methods were applied to study the interactions between the model c-Ski domain and Smad3-MH2 domain. Knowledge of biochemical data, contacts observed between key residues and solvent accessibility calculations of residues of both proteins in our top models were applied to finalize four best favored complexes of Smad3-Ski that can be used to design small molecule inhibitors antagonizing the c-Ski binding which may lead to anti-cancer drug design by appropriately regulating Smad3-Ski interaction. Besides homology modeling and docking, this thesis work also include virtual screening of small molecular databases to identify high scoring lead molecules against Smad4 binding site of c-Ski and also against Smad binding site of Smurf1 a key regulator of BMP regulated Smad proteins). Widely used structure-based high throughput virtual screening methods, the GLIDE and the GOLD, were applied for molecular docking studies. Previously identified active site of Smurf1-WW2 domain was targeted, which is known to interact with PPXY motif in Smad1/5 for inhibiting the ubiquitination of the bone inducing Smads by Smurf1. BMP signaling is inhibited due to strong Smad4-Ski binding so in the subsequent studies we focused on designing small molecule inhibitors of c-Ski at Smad4 interacting sites. Both these virtual screening experiments aim at developing a simple, safe and cost effective bone inducing drug that inhibit c-Ski binding to Smad4 and Smurf1 binding to Smad1/5 to increase the BMP responsiveness.

Perhaps You will be interested in these papers

Statistical learning and Behrens-Fisher distribution methods for heteroscedastic data in microarray analysis

The aim of the present study is to identify the differentially expressed genes between two different conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statisticians because of its high dimensionality and small sample size, dubbed as “small n large p problem”. Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal variances in two conditions, but it is not true in general. Since the number of replications is very small, the sample estimates of variances are not appropriate for the testing. Therefore, we have to consider the Bayesian approach to approximate the variances in two conditions. Because the number of genes to be tested is usually large and the test is to be repeated thousands of times, there is a multiplicity problem. To remove the defect arising from multiple comparison, we use the False Discovery Rate FDR) correction. Applying the hypothesis test repeatedly gene by gene for several thousands of genes, there is a great chance of selecting false genes as differentially expressed, even though the significance level is set very small. For the test to be reliable, the probability of selecting true positive should be high. To control the false positive rate, we have applied the FDR correction, in which the p-values for each of the gene is compared with its corresponding threshold. A gene is, then, said to be differentially expressed if the p-value is less than the threshold. We have developed a new method of selecting informative genes based on the Bayesian Version of Behrens-Fisher distribution which assumes the unequal variances in two conditions. Since the assumption of equal variances fail in most of the situation and the equal variance is a special case of unequal variance, we have tried to solve the problem of finding differentially expressed genes in the unequal variance cases. We have found that the developed method selects the actual expressed genes in the simulated data and compared this method with the recent methods such as Fox and Dimmics t-test method, Tusher and Tibshiranis SAM method among others. The next step of this research is to check whether the genes selected by the proposed Behrens-Fisher method is useful for the classification of samples. Using the genes selected by the proposed method that combines the Behrens Fisher gene selection method with some other statistical learning methods, we have found better classification result. The reason behind it is the capability of selecting the genes based on the knowledge of prior and data. In the case of microarray data due to the small sample size and the large number of variables, the variances obtained by the sample is not reliable in the sense that it is not positive definite and not invertible. So, we have derived the Bayesian version of the Behrens Fisher distribution to remove that insufficiency. The efficiency of this established method has been demonstrated by applying them in three real microarray data and calculating the misclassification error rates on the corresponding test sets. Moreover, we have compared our result with some of the other popular methods, such as Nearest Shrunken Centroid and Support Vector Machines method, found in the literature. We have studied the classification performance of different classifiers before and after taking the correlation between the genes. The classification performance of the classifier has been significantly improved once the correlation was accounted. The classification performance of different classifiers have been measured by the misclassification rates and the confusion matrix. The other problem in the multiple testing of large number of hypothesis is the correlation among the test statistics. We have taken the correlation between the test statistics into account. If there were no correlation, then it will not affect the shape of the normalized histogram of the test statistics. As shown by Efron, the degree of the correlation among the test statistics either widens or shrinks the tail of the histogram of the test statistics. Thus the usual rejection region as obtained by the significance level is not sufficient. The rejection region should be redefined accordingly and depends on the degree of correlation. The effect of the correlation in selecting the appropriate rejection region have also been studied.

Perhaps You will be interested in these papers

Literature based Bayesian analysis of gene expression data

Microarray technologies are rapidly becoming tools for examining genome-wide expression profiles under different experimental conditions. However, identification of differentially expressed DE) genes using microarrays remains challenging. Due in part to confounding technical problems, and daunting statistical challenges to resolve these problems, numerous biological questions are yet unresolved. Numerous statistical feature selection methods can be applied to microarray data, but each method generally results in distinct gene sets. Thus, it is difficult to determine which approach generates biologically important gene sets. In this thesis, we developed a novel method to determine the conceptual relationships between genes directly from MEDLINE abstracts using Latent Semantic Indexing LSI), a variant of the vector space model of information retrieval. We utilized the LSI derived gene-to-gene literature similarity values to calculate a literature derived P-value LPv) for the functional cohesion of various gene sets. The sensitivity and robustness of the LPv method was evaluated using >6000 Gene Ontology classes, a manually selected group of 50 genes. Based on the result, we demonstrate that false discovery rate FDR) can be assessed by applying functional coherence among genes. Those findings motivate our use of local false discovery rate fdr) to identify differentially expressed genes. By integrating LSI derived gene-gene functional relationship into spatially correlated normal mixture model generated by Gaussian Markov random fields GMRF), we show that we can gain more statistical power with local fdr. Consequently, we are able to identify new genes that were undetected by methods that exist in current biostatistics and bioinformatics literature.

Perhaps You will be interested in these papers

Systems theory for pharmaceutical drug discovery

Biological networks are comprised of thousands of interacting components, and these networks have complicated patterns of feedback and feed-forward motifs. It is practically impossible to use intuition to determine whether simultaneously modifying multiple pharmaceutical targets has a good therapeutic response. Even when a drug is discovered which is safe in humans and highly-effective against its target, the medical effect on the disease may be underwhelming. This provides a strong impetus for developing a systems theory for pharmaceutical drug discovery. This thesis discusses system theoretic tools which are useful for doing drug discovery. The first class of tools discussed is system identification tools, and case studies of parametric modeling are given. A new statistical system identification procedure which exploits the geometric and hierarchical structure of many biological (and engineering) systems is presented, and this new procedure is applied to engineering and biological systems. The second class of tools discussed is a new set of target selection tools. Given mathematical models of biological networks, these tools select a set of targets for pharmaceutical drugs. The targets are selected to achieve good medical outcomes for patients by reducing the effect of diseases on pathways and ensuring that the targets do not too adversely affect healthy cells. The ultimate goal of the work presented in this thesis is to create a framework which can be used to rationally select new drug targets and also be able to create personalized medicine treatments which are tailored to the particular phenotypic behavior of an individual’s disease.

Perhaps You will be interested in these papers

Evolution and dynamic behavior of transfer RNA in the first two steps of translation

In protein synthesis, a key component of the cellular machinery is transfer RNA tRNA). This small nucleic acid is crucial to the maintenance of the genetic code because it discriminately binds the messenger RNA codon at the ribosome and adds the cognate amino acid to the growing polypeptide chain. The role of tRNA as an adaptor molecule has been understood for decades, but details about the charging of tRNA with cognate amino acids prior to entering the ribosome are still emerging. Aminoacyl-tRNA synthetases aaRSs) are enzymes that recognize specific tRNAs and amino acids from the cellular pool and facilitate the charging of the correct amino acids on tRNAs. Following aminoacylation, tRNAs dissociate from the aaRSs and bind the elongation factor Tu EF-Tu) for delivery to the ribosome. The recognition of specific tRNA species by the aaRSs, EF-Tu, and other enzymes along the translation pathway is based on sets of highly conserved nucleotides within different groups of tRNA species. Previous work to identify these recognition elements has focused on experimental studies of single organisms. Here, bioinformatic analyses are used to predict recognition elements for groups of tRNA organized by domain of life and specificity. Shannon entropy differences between evolutionary profiles of tRNA domain/specificity groups and the representatives of all tRNA species reveal the uniquely conserved nucleotides within each tRNA domain/specificity, consistent with experiment. Comparative analysis of consensus sequences for these evolutionary profiles is used to locate tuning elements, also consistent with experiment. The discriminator base and the G53-C63 base pair are identified as conserved in several tRNA domain/specificities, particularly among Archaea. Both sets of predictions expand on the current knowledge of recognition elements, providing suggestions for new mutation studies. AaRS-tRNA complex formation and the aminoacylation reaction are well-characterized through many high resolution crystal structures and biochemical assays, but dissociation of the charged tRNA with subsequent binding to EF-Tu is not well understood. Using molecular modeling and molecular dynamics simulations, the effects of protonation states and the presence/absence of substrates and EF-Tu on tRNA release are explored. Using multiple dynamics and energetics analyses, the migration of protons from the 3 end of the tRNA and the alpha-ammonium group on the charging amino acid is shown to accelerate tRNA dissociation. The presence of AMP has only a minimal effect. Further, pKa calculations predict that Glu41, a conserved residue binding the alpha-ammonium group of the charging amino acid, is part of a proton relay system for releasing the charging amino acid upon transfer. This system is conserved both in structure and sequences across homologous aaRSs and may represent a universal handle for binding and releasing the charging amino acid. Addition of EF-Tu to the aaRS-tRNA complex stimulates tRNA dissociation. Knowledge of the exit strategies leads to a greater understanding of tRNA dynamics between the first two steps of translation.

Perhaps You will be interested in these papers

Statistical models for haplotyping complex human diseases with a family-based design

It has long been recognized that many human diseases involve the action of multiple genes and nongenetic factors and also show strong correlation among relatives. Because of this complexity, genetic mapping with a family-based design including parents and offspring) is particularly needed for identifying genes and their inheritance involved in human diseases. In this dissertation, I explore several fundamental aspects of family data in constructing the linkage disequilibrium map of the human genome and fine mapping disease genes. A library of statistical models has been derived to estimate and test the pattern of gene segregation in a natural population and genetic effects of haplotypes on complex diseases. Because genetic information of interest to population and biomedical genetic studies cannot be observed, a series of mixture models proven powerful for solving missing data problems have been built within the family design. These models generate a number of testable hypotheses about the genetic control of complex diseases. Specifically, this dissertation presents various solutions into genetic and statistical problems in the following ways: 1) Construct a multilocus population and multilocus quantitative genetic model with SNP data: The models proposed allow the test of high-order disequilibria on the diversity of a natural population and of crossover interference on the transmission of genes during meiosis. By tracing the path of gene transmission from different parents, the models provide a way of quantitatively testing genetic imprinting effects on human diseases. 2) Develop a new approach for estimating linkage disequilibria at the zygote level: The family design has a capacity of separating the diplotypes that form the same heterozygote and, thereby, estimating gametic and non-gametic disequilibria and trigenic and quadrigenic disequilibria. The new approach relaxes the Hardy-Weinberg equilibrium for a population and extends the concept of linkage disequilibrium mapping to any nonequilibrium populations. 3) Derive a series of closed forms for the EM algorithm: These algorithms are shown to be robust for estimating population genetic parameters including haplotype frequencies and linkage disequilibria of various orders), gene transmission parameters including the recombination fractions and crossover interference), and quantitative genetic parameters including additive, dominant, and imprinting effects of haplotypes). The accuracy and precision of parameter estimates are investigated through simulation studies. The dissertation provides a handful of state-of-art technologies for genetic mapping of human diseases with commonly used family-based designs. These technologies, coupled with empirical and laboratory studies, will help to predict the occurrence and progression of a disease using the information about its underlying genes and biological pathways.

Perhaps You will be interested in these papers

Functional mapping of dynamic systems

The dynamic pattern of viral load in a patient’s body critically depends on the host’s genes. For this reason, the identification of those genes responsible for virus dynamics, although difficult, is of fundamental importance to design an optimal drug therapy based on patients’ genetic makeup. Here, we present a differential equation (DE) model for characterizing specific genes or quantitative trait loci (QTLs) that affect viral load trajectories within the framework of a dynamic system. The model is formulated with the principle of functional mapping, originally derived to map dynamic QTLs, and implemented with a Markov chain process. The DE-integrated model enhances the mathematical robustness of functional mapping, its quantitative prediction about the temporal pattern of genetic expression, and therefore its practical utilization and effectiveness for gene discovery in clinical settings. The model was used to analyze simulated data for viral dynamics, aimed to investigate its statistical properties and validate its usefulness. With an increasing availability of genetic polymorphic data, the model will have great implications for probing the molecular genetic mechanism of virus dynamics and disease progression. This thesis consists of five chapters. In Chapter 1 we briefly summarize the importance of study the dynamic system from the genetic viewpoint. In Chapter 2 we develop the general framework for virus dynamic models. We focus on drug resistance with parameters having Bayesian structure in Chapter 3. Chapter 4 discusses the EM algorithm of mixture models used in Chapters 2 and 3. Some useful results have been given with strict math proof, which guarantees the correctness of the algorithm. The final chapter, Chapter 5, we talk about the ongoing research and future work.

Perhaps You will be interested in these papers

Analysis of high-throughput biological data: Some statistical problems in RNA-seq and mouse genotyping

The many areas of research of high-throughput computational biology provide endless opportunities for methodological contributions by statisticians. In this thesis, we present results in two main areas, one just emerging and one well-established. In Part I of this thesis, we present new results related to the analysis of high-throughput sequencing data. The last year or so has seen the emergence of many new technologies aimed at enabling the massively parallel sequencing of many molecules of DNA simultaneously. This technological leap forward has enabled scientists to conduct exciting experiments that were impossible with previous technologies, and statisticians are being flooded with new data to analyze. We focus on two analytical problems related to new short-read sequencing technologies, each aimed at a different aspect of the goal of quantifying gene expression using sequencing. First, we present a new method aimed at determining which gene a particular sequence fragment originated from, in order to obtain better unbiased estimates of gene expression. Second, we develop a new empirical Bayes test statistic aimed at measuring differential gene expression between two samples which have been sequenced. Both problems combine fundamental statistical concepts with cutting-edge biology research. Part II of this thesis focuses on genetic analysis of the mouse model organism, a more established area of both biological and statistical inquiry. We present an analysis of the performance of a high-throughput microarray in measuring genotype information in a pooled set of mice, for the purposes of detecting a disease-carrying mutation locus. This problem combines relatively new technological advances with classical theories of linkage analysis.

Perhaps You will be interested in these papers

A hierarchical spherical radial quadrature algorithm for multilevel GLMMs, GSMMs, and gene pathway analysis

The first part of my thesis is concerned with estimation for longitudinal data using generalized semi-parametric mixed models and multilevel generalized linear mixed models for a binary response. Likelihood based inferences are hindered by the lack of a closed form representation. Consequently, various integration approaches have been proposed. We propose a spherical radial integration based approach that takes advantage of the hierarchical structure of the data, which we call the 2 SR method. Compared to Pinheiro and Chaos multilevel Adaptive Gaussian quadrature [37], our proposed method has an improved time complexity with the number of functional evaluations scaling linearly in the number of subjects and in the dimension of random effects per level. Simulation studies show that our approach has similar to better accuracy compared to Gauss Hermite Quadrature GHQ) and has better accuracy compared to PQL especially in the variance components. The second part of my thesis is concerned with identifying differentially expressed gene pathways/gene sets. We propose a logistic kernel machine to model the gene pathway effect with a binary response. Kernel machines were chosen since they account for gene interactions and clinical covariates. Furthermore, we established a connection between our logistic kernel machine with GLMMs allowing us to use ideas from the GLMM literature. For estimation and testing, we adopted Clarksons spherical radial approach [6] to perform the high dimensional integrations. For estimation, our performance in simulation studies is comparable to better than Bayesian approaches at a much lower computational cost. As for testing of the genetic pathway effect, our REML likelihood ratio test has increased power compared to a score test for simulated non-linear pathways. Additionally, our approach has three main advantages over previous methodologies: 1) our testing approach is self-contained rather than competitive, 2) our kernel machine approach can model complex pathway effects and gene-gene interactions, and 3) we test for the pathway effect adjusting for clinical covariates. Motivation for our work is the analysis of an Acute Lymphocytic Leukemia data set where we test for the genetic pathway effect and provide confidence intervals for the fixed effects.

Perhaps You will be interested in these papers

Improving statistical methods in biological pathway analysis

The integrated analysis of genetic and biological pathway data is crucial to the understanding of systems biology. The components of biological processes are affected by the levels of expression of genes, which control the production of proteins. The presence or absence of specific proteins leads to disruptions in metabolic or signaling pathways, which affects the stability of an organisms biological system. Identifying where differentially expressed genes control the behavior of reactions or signals in pathways is a computational and biologically complex statistical challenge. A plethora of statistical methods are available to quickly ascertain which genes studied in an experiment are differentially expressed DE) between varying biological conditions. DE genes can then be mapped to biological pathways or networks to discover where they influence reactions and signals in pathways. Statistical methods developed in attempts to make such discoveries determine where DE genes are over-represented in pathways, but unfortunately do not generally acknowledge the structure of these pathways. This omission of biologically relevant information is a crucial mistake made by the statisticians who develop the methods and the biologists who use them. Over-representation methods also exhibit a sample size bias in that small p-values are not easily obtained for pathways involving few genes. Additionally, valuable information is lost when gene p-values, z-scores, fold changes, or other measures are divided into dichotomous groups. The direction and magnitude of a gene measure provide more evidence of true differential expression and should be included in any pathway analysis. This dissertation summarizes several existing pathway analysis methods, including over-representation methods, methods that utilize other available gene measures, and methods that incorporate pathway structure. Because the pathways analyzed in most methods are subjectively defined by their starting and stopping points, and thus may contribute to incorrect results, a proposed reconstruction of pathways is defined. A new method called Weighted-Averages for Reconstructed Pathways Path WeAveRs) is introduced for use on the newly rearranged “pathways”. An investigation of the statistical performance and biological relevance of Path WeAveRs as compared to other methods is carried out. Results indicate Path WeAveRs produces biologically meaningful pathways and is a viable alternative to existing pathway analysis methods.

Perhaps You will be interested in these papers