Analysis of sibling time series data: alignment and difference detection
    Jennifer Listgarten, Ph.D. thesis, University of Toronto, 2006.

    Obtain the entire (very large file) thesis pdf here (60 MB), or download the table of contents and Chapter 1 first.



    Related Code: Matlab code for the EM-CPM can be found here, along with some LC-MS data sets used in my thesis and papers.

    Thesis Abstract:

    Many practical problems over a wide range of domains require synthesizing information from time series data. Two distinct, yet related, problems in time series data are those of alignment and difference detection. These tasks may be coupled together so that a solution to one is difficult without a solution to the other.

    We introduce a unified, probabilistic approach to the problems of alignment and of alignment with difference detection. This approach takes the form of a class of models called 'Continuous Profile Models' for simultaneously analyzing sets of sibling time series -- those which contain shared sub-structure, but which may also differ. In this type of generative model, each time series belonging to one class is generated as a noisy transformation of a single 'latent trace' in the model. A latent trace can be viewed as an underlying, noiseless representation of the set of observable time series belonging to one class, and is learned from the data. If multiple classes of data exist, then one latent trace per class is learned, and these are aligned to each other during inference. The latent traces lie at the core of this class of models, and provide the basis for alignment and difference detection.

    Our approach to alignment has several benefits over traditional approaches. It provides a principled method for finding parameters in the model, such as the reference template and error/distance function, rather than specifying these in an a priori and/or ad hoc manner. It simultaneously aligns all data in one go, rather than aligning them in a greedy, incremental fashion. It corrects scaling of signal intensity while performing alignment. Additionally, the probabilistic framework allows us the option of using fully Bayesian inference, if desired, so that we may gauge uncertainty in our model parameters, integrate out model parameters, and avoid cross-validation, which can be problematic with limited data. Lastly, the CPM is the first model, to our knowledge, to tackle the simultaneous problems of alignment and difference detection.

    We focus on Liquid-Chromatography-Mass Spectrometry proteomics data, for examination and demonstration of our methods, although our methods are not confined to this domain.


    Supervisors: Sam Roweis and Radford Neal
    Other commitee members: Brendan Frey, Andrew Emili
    External examiner: Kevin Murphy