Academic Journals Database
Disseminating quality controlled scientific knowledge

Web Page Clustering using Latent Semantic Analysis

Author(s): Lalit A. Patil | S M. Kamalapur | Dhananjay Kanade

Journal: International Journal of Computer Applications
ISSN 0975-8887

Volume: iccia;
Issue: 6;
Date: 2012;
Original page

Keywords: Canonical Correlation Analysis | probabilistic latent semantic analysis | term-frequency | Web page clustering

Web mining techniques such as clustering help to organize the web content into appropriate subject based categories so that their efficient search and retrieval becomes manageable. Traditional WebPages clustering typically uses only the page content (usually the page text) in an appropriate feature vector representation such as Bags of words, termfrequency /inverse document frequency ,etc. and then applies standard clustering algorithms(e.g. K-means, Suffix tree, Query directed clustering). For example, Users can provide captions for images on the internet, provide tags to WebPages and other media content they regularly browse on the internet, etc. Therefore such user generated content can provide useful information in various form such as meta-data or in more explicit ways such as tags. Typically, WebPages clustering algorithms only use feature extracted from the page text. However, the advent also social bookmaking websites, such as StumbleUpon and Delicious has led to a huge amount of usergenerated content such as the information that is associated with the WebPages. In multi-view learning, the feature can be split into two subset alone is sufficient for learning. Here as for, unsupervised learning algorithms, multiple views of the data can often help in extracting better features. Canonical Correlation Analysis (CCA) is an unsupervised feature extraction technique for finding dependencies between two (or more) views of the data by maximizing the correlations between the views in a shared subspace. But the drawbacks of the CCA is it gives The first approach is based on an annotation based probabilistic latent semantic analysis (LSA) over document-word and tagword co-occurrence matrices
Affiliate Program      Why do you need a reservation system?