Academic Journals Database
Disseminating quality controlled scientific knowledge

Open-Source Boundary-Annotated Qur’an Corpus for Arabic and Phrase Breaks Prediction in Classical and Modern Standard Arabic Text

ADD TO MY LIST
 
Author(s): SAWALHA, M.S. | BRIERLEY, C. | ATWELL, E.

Journal: Journal of Speech Sciences
ISSN 2236-9740

Volume: 2;
Issue: 2;
Start page: 175;
Date: 2012;
Original page

Keywords: phrase break prediction | prosodic annotation | Tajwid recitation | N-gram and HMM taggers | boundary-annotated and PoS-tagged Qur’an

ABSTRACT
A phrase break classifier is needed to predict natural prosodic pauses in text to be read out loud byhumans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-ofspeechtagged corpus. Boundary annotations in English speech corpora are descriptive, delimitingintonation units perceived by the listener; manual annotation must be done by an expert linguist. ForArabic, there are no existing suitable resources. We take a novel approach to phrase break prediction forArabic, deriving our prosodic annotation scheme from Tajwid (recitation) mark-up in the Qur’an whichwe then interpret as additional text-based data for computational analysis. This mark-up is prescriptive,and signifies a widely-used recitation style, and one of seven original styles of transmission. Here wereport on version 1.0 of our Boundary-Annotated Qur’an dataset of 77430 words and 8230 sentences,where each word is tagged with prosodic and syntactic information at two coarse-grained levels. We thenuse this dataset to train, test, and compare two probabilistic taggers (trigram and HMM) for Arabicphrase break prediction, where the task is to predict boundary locations in an unseen test set stripped ofboundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks inthe training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gainsin accuracy with a trigram tagger, and significant gains in performance recognition of minority classinstances with both taggers via the Balanced Classification Rate metric. This is initial work on a longtermresearch project to produce annotation schemes, language resources, algorithms, and applicationsfor Classical and Modern Standard Arabic.
Save time & money - Smart Internet Solutions      Why do you need a reservation system?