Abstract:
With the advancement of science and technology, data has increased both in sam-
ple size and dimension. Examples of high-dimensional data include genomic
data, text data, image retrieval, bioinformatics, etc. One of the major problems in
handling such data is that all the features are not equally important. Hence, fea-
ture engineering, feature selection and feature reduction are considered important
pre-processing tasks to discard redundant, irrelevant features while preserving
the prominent features of the data as much as possible. Feature selection, in
practice, often improves the accuracy of down-stream machine learning problems,
including clustering and classification.
In this thesis, we aim to devise some novel and robust feature selection mecha-
nisms in diverse domains of applications with a special focus on high dimensional
biological data such as gene expression and single cell transcriptomic data. We
develop a series of feature selection techniques equipped with structure-aware
data sampling at its core. We adopt several concepts from statistics (e.g. copula
and its variant), information theory (entropy), and advanced machine learning
domain (variational graph autoencoder, generative adversarial network, and its
variant) to design the feature selection models for high dimensional and noisy
data. The proposed models perform extremely well both in supervised and unsu-
pervised cases, even if the sample size is very low. Important outcomes from all
the proposed methods are discussed in chapters. Moreover, an overall discussion
about the applicability along with a brief mention of the shortcomings of all the
discussed methods is provided. Some suggestions and guidance are provided to
overcome the disadvantages which direct the future scope of improvement of all
the devised methods.