Obtain the
entire (very large file) thesis pdf here (60 MB), or download
the
table of contents and Chapter 1
first.
Related Code:
Matlab code for the EM-CPM can be found
here, along with some
LC-MS data sets used in my thesis and papers.
Thesis Abstract:
Many practical problems over a wide range of domains require
synthesizing information from time series data. Two distinct, yet
related, problems in time series data are those of alignment and
difference detection. These tasks may be coupled together so that
a solution to one is difficult without a solution to the other.
We introduce a unified, probabilistic approach to the problems of
alignment and of alignment with difference detection. This
approach takes the form of a class of models called
'Continuous Profile Models' for simultaneously analyzing sets of
sibling time series -- those which contain shared sub-structure, but
which may also differ. In this type of generative model, each time
series belonging to one class is generated as a noisy transformation
of a single 'latent trace' in the model. A latent trace can be
viewed as an underlying, noiseless representation of the set of
observable time series belonging to one class, and is learned from the
data. If multiple classes of data exist, then one latent trace per
class is learned, and these are aligned to each other during
inference. The latent traces lie at the core of this class of models,
and provide the basis for alignment and difference detection.
Our approach to alignment has several benefits over traditional
approaches. It provides a principled method for finding
parameters in the model, such as the reference template and
error/distance function, rather than specifying these in an
a priori and/or ad hoc manner. It simultaneously
aligns all data in one go, rather than aligning them in a greedy,
incremental fashion. It corrects scaling of signal intensity while
performing alignment. Additionally, the probabilistic framework
allows us the option of using fully Bayesian inference, if
desired, so that we may gauge uncertainty in our model parameters,
integrate out model parameters, and avoid cross-validation, which can
be problematic with limited data. Lastly, the CPM is the first model,
to our knowledge, to tackle the simultaneous problems of alignment and
difference detection.
We focus on Liquid-Chromatography-Mass Spectrometry proteomics data, for
examination and demonstration of our methods, although our methods are
not confined to this domain.
Supervisors:
Sam Roweis
and
Radford Neal
Other commitee members:
Brendan Frey,
Andrew Emili
External examiner:
Kevin Murphy