Author(s): Shreya Jain | Samta Gajbhiye
Journal: International Journal of Soft Computing & Engineering
ISSN 2231-2307
Volume: 2;
Issue: 1;
Start page: 392;
Date: 2012;
VIEW PDF
DOWNLOAD PDF
Original page
Keywords: Data Mining | Text Mining | Clustering | K-Means Clustering | Silhouette plot .
ABSTRACT
Clustering is a powerful technique for large scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in a high dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. Hence to improve the efficiency & accuracy of mining task on high dimensional data the data must be pre-processed by an efficient dimensionality reduction method. Recently cluster analysis is popularly used data analysis method in number of areas. K-Means is one of the well known partitioning based clustering technique that attempts to find a user specified number of clusters represented by their centroids. In this paper, a certain k-means algorithm for clustering the data sets is used and the algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. Also in this paper, we deal with the analysis of different sets of k-values for better performance of the k-means clustering algorithm.
Journal: International Journal of Soft Computing & Engineering
ISSN 2231-2307
Volume: 2;
Issue: 1;
Start page: 392;
Date: 2012;
VIEW PDF


Keywords: Data Mining | Text Mining | Clustering | K-Means Clustering | Silhouette plot .
ABSTRACT
Clustering is a powerful technique for large scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in a high dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. Hence to improve the efficiency & accuracy of mining task on high dimensional data the data must be pre-processed by an efficient dimensionality reduction method. Recently cluster analysis is popularly used data analysis method in number of areas. K-Means is one of the well known partitioning based clustering technique that attempts to find a user specified number of clusters represented by their centroids. In this paper, a certain k-means algorithm for clustering the data sets is used and the algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. Also in this paper, we deal with the analysis of different sets of k-values for better performance of the k-means clustering algorithm.