Academic Journals Database
Disseminating quality controlled scientific knowledge

Automated Online News Content Extraction

Author(s): B. A. Ojokoh

Journal: International Journal of Computer Science Research and Application
ISSN 2012-9564

Volume: 02;
Issue: 03;
Start page: 02;
Date: 2012;
VIEW PDF   PDF DOWNLOAD PDF   Download PDF Original page

Keywords: Online news | Information extraction | RSS feeds | Title | HTML | Document Object Model | Search Engine

With the growth of the Internet and related tools, there has been an exponential growth of online resources. This tremendous growth has paradoxically made the task of finding, extracting and aggregating relevant information difficult. These days, finding and browsing news is one of the most important internet activities. In this paper, a hybrid method for online news article contents extraction is presented. The method combines RSS feeds and HTML Document Object Model (DOM) tree extraction. This approach is simple and effective at solving the problems associated with heterogeneous news layout and changing content found in many existing methods. The experimental results on some selected news sites show that the approach can extract news article contents automatically, effectively and consistently. The proposed method can also be adopted for other news sites.
RPA Switzerland

Robotic Process Automation Switzerland


Tango Jona
Tangokurs Rapperswil-Jona