| Title | Efficient computation of regularities in strings and applications |
| Abstract | The first part of the thesis is the development of space and time efficient nonextendible NE) and supernonextendible SNE) repeats algorithms RPT, shown to be more efficient than previous methods based on tests using different real data sets. In particular, we describe four variants of a new fast algorithm RPT1 that, based on suffix array construction, computes all the complete NE repeats in a given string x whose length period) p &geï¼› pmin, where pmin &geï¼› 1 is a user-specified minimum. RPT1 uses 5n bytes of space directly, but requires the LCP array, whose construction needs 6n bytes. The variants RPT1-3 and RPT1-4 execute in O n) time independent of alphabet size and are faster than the two other algorithms previously proposed for this problem. To provide a basis of comparison for RPT1, we also describe a straightforward algorithm RPT2 that computes complete NE repeats without any recourse to suffix arrays and whose total space requirement is only 5n bytesï¼› however, this algorithm is slower than RPT1. Furthermore, we describe new fast algorithms RPT3 for computing all complete SNE repeats in x. Of these, RPT3-2 executes in thetan) time independent of alphabet size, thus asymptotically faster than the methods previously proposed. We conclude with a brief discussion of applications to bioinformatics and data compression. The second part of the thesis deals with the issue of finding the NE multirepeats in a set of N strings of average length n under various constraints. A multirepeat is a repeat that occurs at least m times m &geï¼› 2) in each of at least q &geï¼› 1 strings in a given set of strings. We show that RPT1 can be extended to locate the multirepeats based on the investigation of the properties of the multirepeats and various strategies. We describe algorithms to find complete NE multirepeats, first with no restriction on “gap length” that is, the gap between occurrences of the multirepeat), then with bounded gaps. For the first problem, we propose two algorithms with worst-case time complexities ONn+alpha log 2 N) and ONn+alpha) that use 9Nn and 10Nn bytes of space, respectively, where a is the alphabet size. For the second problem, we describe an algorithm with worst-case time complexity ORNn) that requires approximately 10Nn bytes, where R is the number of multirepeats output. We remark that if we set the min and max constraints on gaps equal to zero in this algorithm, we can find all repetitions tandem repeats) in arbitrary subsets of a given set. We demonstrate that our algorithms are faster, more flexible and much more space efficient than algorithms recently proposed for this problem. Finally, the third part of the thesis provides a convenient framework for comparing the LZ factorization algorithms which are used in the computation of regularities in strings rather than in the traditional application to text compression. LZ factorization is the computational bottleneck in numerous string processing algorithms, especially in regularity studies, such as computing, repetitions, runs, repeats with fixed gap, branching repeats, sequence alignment, local periods, and data compression. Since branching repeats, sequence alignment, local periods, and data compression. Since 1977, when Ziv and Lempel described a kind of string factorization useful for data compression, there has been a succession of algorithms proposed for computing “LZ factorization”. In particular, there have been several recent algorithms proposed that extend the usefulness of LZ factorization, especially to the computation of runs in a string x. We choose these algorithms and analyze each algorithm separately, and remark on them by comparing some of their important aspects, for example, additional space required and handling mechanism. We also address their output format differences and some special features. We then provide a complete theoretical comparison of their time and space efficiency. We conduct intensive testing on both time and space performance and analyze the results carefully to draw conclusions in which situations these algorithms perform best. Abstract shortened by UMI.) |
| Category | Pure Sciences |
| Subject | ComputerScience, |
| FileType | |
| Pages | 178 |
| Price | US$70.00 |
| Language | English |
| Buy Now | |
| Download | |
| Contact |
E-Mail:itpaper@hotmail.com TEL:1-888-786-998A |
| FAQ |
How to get this paper's electronic documents? 1, Click the "Buy Now" button to complete the online payment 2, Download the paper's electronic document from the successful payment return page/Or the system will send this paper's electronic document to your E-Mail within 24 hours |
| Favorite | ADD TO FAVORITE |
Efficient computation of regularities in strings and applications
Category: Pure Sciences
Tag: ComputerScience
Perhaps You will be interested in these papers
2012-03-11 A hybrid domain decomposition method and its applications to contact problems
2012-03-10 Parallel iterative algorithms for large sparse linear systems
2012-03-10 Species-specific protein secondary structure prediction
2012-03-10 A Non Linear Frequency Domain-Spectral Difference Scheme for Unsteady Periodic Flows
2012-03-09 The human serum glycan cancer biomarker analysis pipeline
2012-03-08 Semi-analytical method for analyzing models and model selection measures
2012-03-08 On generalizations of Gowers norms
2012-03-08 Indeterminate strings