A string is one of the most fundamental and important structures in computing, bioinformatics and mathematics. Generally, we define a string as a sequence of simple elements letters) drawn from a set or alphabet. Unlike regular strings, an indeterminate string has not a single letter, but possibly a set of letters in each positionï¼› we call them indeterminate letters. An indeterminate string can essentially be viewed as a string of sets. Indeterminate strings have potentially many applications. In bioinformatics, a set of sequences with segregating sites [CGIï¼‹07] can naturally be represented by an indeterminate string. In human languages, two words or characters might have exactly the same meaning even though they come from two different languages and have different Unicode index. A search engine might make use of this fact and increase its effectiveness in outputting results. In this thesis we investigate existing results and propose new efficient algorithms that process indeterminate strings. The topic we study spans all three kinds of patterns generic, intrinsic and specific). We hope our research on indeterminate strings will contribute to the field of stringology as well as bioinformatics and information retrieval. Periodicity properties of indeterminate strings Generic patterns). First of all, the periodicity of indeterminate strings is different from conventional strings. For example, the periodicity lemma [FW65] is one of the most fundamental properties in the field of string regularity. Yet it does not hold directly for indeterminate strings. Iliopoulos et al. [IMM ï¼‹03] studied string periodicity for dont care letters. Berstel et al. [BB99] and Blanchet-Sadri et al. [BH02, Bla04b] studied periodicity properties for partial words. We propose new proofs and efficient algorithms based on elementary Euclidian-style recursion [SW09b]. Pattern-matching algorithms of indeterminate strings based on letter comparison Specific patterns). Exact pattern-matching algorithms have been extensively studied in the last three decades and many algorithms exist. However, not too many known algorithms work for indeterminate pattern-matching, except for some algorithms that use bit-array technology such as the ShiftAnd algorithm [Dom68, WM92, BYG92] and its derivative BNDM [NR98]. However as we will see, even these algorithms can only handle some special cases of indeterminate pattern-matching. We present several new efficient and more general indeterminate pattern-matching algorithmsï¼› including iBMS [HSW05, HSW08], iFJS [HSW06] and an adaptive hybrid algorithm [SWY08]. Pattern-matching algorithm of indeterminate strings based on convolution Specific patterns). We also present a new algorithm to perform pattern-matching on some special versions of indeterminate pattern-matching using the convolution method. Using techniques such as the Fast Fourier Transform, this method is totally different from the pattern-matching algorithms based on letter comparison discussed above. Prefix array of indeterminate strings Intrinsic patterns). Next we describe a new data structure called the prefix array [ML84, CHL01, CHL07, SW08] for both regular and indeterminate strings. We first show how to construct a boolean matrix to store all the borders and therefore all the periods) of all the prefixes of a string, and then show how to compress this matrix into a space-efficient prefix array without loss of useful information. We design an efficient algorithm to compute the prefix array of a regular or indeterminate string. We also study the opposite problem: given an integer array, how to identify whether it is a valid prefix array of some string, and construct such a string with minimum lexicographical order. Squares & runs of indeterminate strings Generic patterns). We also describe a new and promising data structure proposed by Blanchet-Sadri et al. that stores all the squares repeating substrings) in an indeterminate string. We present an improved efficient way of computing this table. We then show how this table enables us to retrieve all the weak) runs of an indeterminate string. Abstract shortened by UMI.)
How to get this paper's electronic documents?
1, Click the "Buy Now" button to complete the online payment
2, Download the paper's electronic document from the successful payment return page/Or the system will send this paper's electronic document to your E-Mail within 24 hours
|Favorite||ADD TO FAVORITE|
Perhaps You will be interested in these papers
2012-03-08 On generalizations of Gowers norms