Abstract:
Privacy preserving computation is of utmost importance in a cloud computing environ-
ment where a client often requires to send sensitive data to servers, offering computing
services, for computational purposes over untrusted networks. Sharing the raw or an ab-
stract representation of a labelled or unlabelled dataset on cloud platforms can potentially
expose sensitive information of the data to an adversary, e.g., in the case of an emotion
classification task from text, an adversary-agnostic abstract representation of the text data
may eventually lead an adversary to identify the demographics of the authors, such as their
gender and age, etc. The leakage of sensitive information from the data may take place due
to eavesdropping over the network or malware residing at the server. Privacy preserving
computation workflows aim to prevent such leakage of sensitive information by introducing
a suitable encoding transformation on sample data points. Such an encoding strategy has
dual objectives, the first being that it should be difficult to reconstruct the original data in
the absence of any knowledge of the encoding strategy and its parameters. Secondly, the
computational results obtained using the encoded data should not be substantially different
from those obtained using the same data in its original form. Standard encoding mechanisms,
such as locality sensitive hashing (LSH), caters to the first objective of privacy preserving
computation workflow, the second objective may not always be adequately satisfied. In this
thesis, we focus on the second objective and the computational activity that we focus on
is a supervised classification task in addition to the K-means clustering, which has been
widely used for various data mining jobs. Here, we have addressed the problem of privacy
preserving computation on the above two tasks in three different ways,
Initially, we have proposed a new variant of the K-means algorithm which is capable
of privacy preservation in the sense that it takes binary encoded data as input, and does not
require access to the data in its original form at any stage of the computation. The proposed
strategy is capable of producing the required number of clusters which are sufficiently close
to the respective clusters computed from the original non-encoded data. The results of the
proposed strategy on image or text data are either comparable or outperform the standard
K-means clustering algorithm.
Secondly, we have explored a deep metric learning approach to learn a parameterized
encoding transformation with an objective of maximizing the alignment of the clusters
obtained in the encoded space with the same obtained from the original data. To this end,
we train a weakly supervised deep network using triplets constructed from the output of a
clustering algorithm on a subset of the non-encoded data. Our proposed method of weakly-
supervised approach yields more effective encoding in comparison to approaches where the
encoding process is agnostic of the clustering objective.
Finally, we propose a universal defense mechanism against malicious attempts of stealing
sensitive information from data shared on cloud platforms. More specifically, our proposed
method employs an informative subspace based multi-objective approach to produce a
sensitive information aware encoding of the data representation. A number of experiments
conducted on both standard text and image datasets demonstrate the ability of our proposed
approach to reduce the effectiveness of the adversarial task without remarkably affecting the
effectiveness of the primary task itself.