Academic Journals Database
Disseminating quality controlled scientific knowledge

Gigafida and slWaC: topic comparison

Author(s): Nataša Logar Berginc | Nikola Ljubešić

Journal: Slovenščina 2.0 : Empirične, Aplikativne in Interdisciplinarne Raziskave
ISSN 2335-2736

Volume: 1;
Issue: 1;
Start page: 78;
Date: 2013;
VIEW PDF   PDF DOWNLOAD PDF   Download PDF Original page

Keywords: Slovenian language | reference corpus | Web corpus | topic modeling

In the article, the following two issues are analyzed: (a) incorporation of texts from the Internet into existing reference corpora and comparison with the existence of web corpora, and (b) the latest two corpora of Slovenian language texts: the Gigafida corpus consisting mainly of printed texts and to a lesser extent also web texts, and the slWaC corpus which is entirely compiled from web texts. First, similarities and differences between the two corpora are identified using the topic modelling method, and then the same method is applied to the individual taxonomic categories of the Gigafida corpus. The first part of the analysis showed that the work of reference corpus compilers is currently still incoherent with regard to the incorporation of Internet texts into corpora which should reveal the overall picture of a certain language. In case compilers decide to incorporate web texts, the range of included genres is generally broad. The second part of the analysis showed a significant thematic variation between the Gigafida and slWaC corpora, and pointed out the most typical themes covered by each of the six Gigafida corpus parts.
RPA Switzerland

Robotic Process Automation Switzerland


Tango Rapperswil
Tango Rapperswil