Abstract:
The problem of clustering aims to partition unlabeled data so as to reveal the natural affinities between data instances. Modern learning algorithms need to be designed to be applicable on larger datasets that can also be high dimensional. While acquiring more features and instances can be beneficial, the addition of noisy and irrelevant features can also obfuscate the true structure of the data; distance metrics can also fail at high dimensions. To address these challenges, complex mathematical structures can be used to model different aspects of a problem, however they can also lead to algorithms with high computation costs, making the algorithms infeasible for larger datasets.
Among existing classes of clustering methods, we focus on the class of center-based clustering which in general consists of methods with low computation costs that scale linearly with the size of datasets. We identify different factors that have influence over how effective center-based clustering methods can be. Estimating the number of clusters is still a challenge, for which we study existing approaches that have a wide range of computation costs, and propose two low-cost approaches based on two possible definitions of a cluster. Selecting a suitable distance metric for clustering is also an important factor. We incorporate a kernel metric in a center-based clustering method and investigate its performance in the presence of a large number of clusters. Feature selection and feature extraction methods exist to identify which features can help estimate the clusters. We focus on sparse clustering methods and propose a significantly lower computation approach to simultaneously select features while clustering. Another important factor is the nature of the clusters identified. Hard clustering methods identify discrete clusters, whereas soft clustering methods allow soft cluster assignments of data points to more than one cluster, thereby allowing overlapped clusters to be identified. We propose a multi-objective evolutionary fuzzy clustering method that can identify partitions at different degrees of overlap. Clustering in unsupervised conditions can come with a serious limitation. Instead of exploring a wide solution space completely unsupervised, some additional supervision can bias the method to identify clustering solutions that better fit a dataset. This motivates us to propose a transfer clustering method that learns a multiple kernel metric in a weakly supervised setting, and then transfers the learned metric to cluster a dataset in an unsupervised manner. A lower effort is required to provide weak supervision in comparison to full supervision, while drastically boosting clustering performance. We recommend weakly supervised clustering as a promising new direction to overcome the inherent limitations of identifying clusters in an unsupervised manner.