Online Public Access Catalogue (OPAC)
Library,Documentation and Information Science Division

“A research journal serves that narrow

borderland which separates the known from the unknown”

-P.C.Mahalanobis


Image from Google Jackets

On efficient center-based clustering: from unsupervised learning to clustering under weak supervision/ Avisek Gupta

By: Material type: TextTextPublication details: Kolkata: Indian Statistical Institute, 2021Description: 175 pagesSubject(s): DDC classification:
  • 23 000SA.072 G977
Online resources:
Contents:
1 Introduction to center-based clustering -- 2 Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering -- 3 On the unification of k-harmonic means and fuzzy c-means clustering problems under kernelization -- 4 Improved efficient model selection for sparse hard and fuzzy center- based clustering -- 5 Fuzzy clustering to identify clusters at different levels of fuzziness: an evolutionary multi-objective optimization approach -- 6 Transfer clustering using multiple kernel metrics learned under multi- instance weak supervision -- 7 Conclusion
Production credits:
  • Guided by Prof. Swagatam Das
Dissertation note: Thesis (Ph.D.) -Indian Statistical Institute, 2021 Summary: The problem of clustering aims to partition unlabeled data so as to reveal the natural affinities between data instances. Modern learning algorithms need to be designed to be applicable on larger datasets that can also be high dimensional. While acquiring more features and instances can be beneficial, the addition of noisy and irrelevant features can also obfuscate the true structure of the data; distance metrics can also fail at high dimensions. To address these challenges, complex mathematical structures can be used to model different aspects of a problem, however they can also lead to algorithms with high computation costs, making the algorithms infeasible for larger datasets. Among existing classes of clustering methods, we focus on the class of centerbased clustering which in general consists of methods with low computation costs that scale linearly with the size of datasets. We identify different factors that have influence over how effective center-based clustering methods can be. Estimating the number of clusters is still a challenge, for which we study existing approaches that have a wide range of computation costs, and propose two lowcost approaches based on two possible definitions of a cluster. Selecting a suitable distance metric for clustering is also an important factor. We incorporate a kernel metric in a center-based clustering method and investigate its performance in the presence of a large number of clusters. Feature selection and feature extraction methods exist to identify which features can help estimate the clusters. We focus on sparse clustering methods and propose a significantly lower computation approach to simultaneously select features while clustering. Another important factor is the nature of the clusters identified. Hard clustering methods identify discrete clusters, whereas soft clustering methods allow soft cluster assignments of data points to more than one cluster, thereby allowing overlapped clusters to be identified. We propose a multi-objective evolutionary fuzzy clustering method that can identify partitions at different degrees of overlap. Clustering in unsupervised conditions can come with a serious limitation. Instead of exploring a wide solution space completely unsupervised, some additional supervision can bias the method to identify clustering solutions that better fit a dataset. This motivates us to propose a transfer clustering method that learns a multiple kernel metric in a weakly supervised setting, and then transfers the learned metric to cluster a dataset in an unsupervised manner. A lower effort is required to provide weak supervision in comparison to full supervision, while drastically boosting clustering performance. We recommend weakly supervised clustering as a promising new direction to overcome the inherent limitations of identifying clusters in an unsupervised manner.
Tags from this library: No tags from this library for this title. Log in to add tags.

Thesis (Ph.D.) -Indian Statistical Institute, 2021

Includes bibliographical references

1 Introduction to center-based clustering -- 2 Fast automatic estimation of the number of clusters from the minimum
inter-center distance for k-means clustering -- 3 On the unification of k-harmonic means and fuzzy c-means clustering
problems under kernelization -- 4 Improved efficient model selection for sparse hard and fuzzy center-
based clustering -- 5 Fuzzy clustering to identify clusters at different levels of fuzziness: an
evolutionary multi-objective optimization approach -- 6 Transfer clustering using multiple kernel metrics learned under multi-
instance weak supervision -- 7 Conclusion

Guided by Prof. Swagatam Das

The problem of clustering aims to partition unlabeled data so as to reveal the
natural affinities between data instances. Modern learning algorithms need to be
designed to be applicable on larger datasets that can also be high dimensional.
While acquiring more features and instances can be beneficial, the addition of
noisy and irrelevant features can also obfuscate the true structure of the data;
distance metrics can also fail at high dimensions. To address these challenges,
complex mathematical structures can be used to model different aspects of a
problem, however they can also lead to algorithms with high computation costs,
making the algorithms infeasible for larger datasets.
Among existing classes of clustering methods, we focus on the class of centerbased
clustering which in general consists of methods with low computation costs
that scale linearly with the size of datasets. We identify different factors that
have influence over how effective center-based clustering methods can be. Estimating
the number of clusters is still a challenge, for which we study existing
approaches that have a wide range of computation costs, and propose two lowcost
approaches based on two possible definitions of a cluster. Selecting a suitable
distance metric for clustering is also an important factor. We incorporate a kernel
metric in a center-based clustering method and investigate its performance in the
presence of a large number of clusters. Feature selection and feature extraction
methods exist to identify which features can help estimate the clusters. We focus
on sparse clustering methods and propose a significantly lower computation
approach to simultaneously select features while clustering. Another important
factor is the nature of the clusters identified. Hard clustering methods identify
discrete clusters, whereas soft clustering methods allow soft cluster assignments
of data points to more than one cluster, thereby allowing overlapped clusters to
be identified. We propose a multi-objective evolutionary fuzzy clustering method
that can identify partitions at different degrees of overlap.
Clustering in unsupervised conditions can come with a serious limitation. Instead
of exploring a wide solution space completely unsupervised, some additional supervision
can bias the method to identify clustering solutions that better fit a
dataset. This motivates us to propose a transfer clustering method that learns
a multiple kernel metric in a weakly supervised setting, and then transfers the
learned metric to cluster a dataset in an unsupervised manner. A lower effort
is required to provide weak supervision in comparison to full supervision, while
drastically boosting clustering performance. We recommend weakly supervised
clustering as a promising new direction to overcome the inherent limitations of
identifying clusters in an unsupervised manner.

There are no comments on this title.

to post a comment.
Library, Documentation and Information Science Division, Indian Statistical Institute, 203 B T Road, Kolkata 700108, INDIA
Phone no. 91-33-2575 2100, Fax no. 91-33-2578 1412, ksatpathy@isical.ac.in