«Finite State Methods in Morphological Analysis of Runyakitara Verbs Fridah KATUSHEMERERWE Makerere University, Uganda & Thomas HANNEFORTH Potsdam ...»
Nordic Journal of African Studies 19(1): 1–22 (2010)
Finite State Methods in Morphological
Analysis of Runyakitara Verbs
Makerere University, Uganda
Potsdam University, Germany
Previously, there has been a lack for an automatic analyser and generator for the word forms
of Runyakitara. In this paper, we present a computational model for grammatical
Runyakitara verbs. This model, RUNYAGRAM, is based on freely-available open-sourced finite-state methods and, in particular, the fsm2 interpreter. It captures the morphotactic structures with non-recursive context-free grammars supported by fsm2 and morphophonological alternations with a finite composition of commonly used context-dependent string rewriting rules. Their combination results into a finite state transducer that can be exported and used in numberless software-developing platforms. The obtained transducer is an important building-block that can be employed in comprehensive morphological analysers, syntactic parsers, spell-checkers, text-to-speech synthesizers, and machine translation systems. Currently, 86% of the verb forms are recognized. It is possible to increase the coverage, or alternatively, to adapt the approach of the RUNYAGRAM system to related languages.
Keywords: morphological analysis, finite state methods, Runyakitara verb.
1. INTRODUCTION One of the core enabling technologies required in natural language processing applications is a morphological analyzer. It is an established fact in computational linguistics that a morphological analyzer is a starting point for many natural language processing applications (Pretorius & Bosch, 2003; Yona & Wintner, 2005).
Computational morphology deals with automatic word-form recognition and generation. The general challenges posed by a computational morphological
analyzer, as described by Prestorious and Bosch, (2003), are twofold:
• Morphemes that make up words cannot combine at random, but are restricted to certain combinations and orders. A morphological analyzer needs to know which combinations of morphemes (morphotactics) are valid.
Nordic Journal of African Studies
• Morphemes may be realized in different ways depending on their context.
A morphological analyzer needs to recognize the morphophonological changes between lexical and surface forms (morphophonological alternation). Automatic morphological analyzers and generators must take into consideration the above issues.
Comprehensive morphological analyzers are available for well documented languages such as English, Swedish, German, Arabic, and Finnish (Karttunen & Beesley, 2005:77). Considerable work has also been achieved in employing finite state methods for Bantu language analysis: the Kiswahili morphological analyzer (Hurskainen, 1992; 1996; 2004); the Zulu analyzer prototype (Pretorius & Bosch, 2003), Lingala verb morphology (Karttunen, 2003), Ekegusii verb morphology (Elwell, 2005), Kinyarwanda (Muhirwe & Trosterud, 2008), and Setswana verb morphology (Pretorius, Berg, & Pretorius, 2009).
However, given the fact that Bantu languages are more than five hundred in number, almost all are still not treated. Although Bantu languages are classified as largely agglutinative and exhibit significant inherent structural similarity, they differ substantially in terms of their phonological features implying that each Bantu language requires an independent morphological analyzer.
Runyakitara is one of those under-resourced Bantu languages with no computational morphology. Bernsten (1998) splits Runyakitara into four major dialects: Runyankore, Runkiga, Runyoro, and Rutooro. Guthrie (1967) groups these four dialects into two languages belonging to Narrow Bantu branch of the Niger-Congo family, Nyankore-Kiga (E.13) and Nyoro-Ganda (E.11). For purposes of this paper, Runyakitara will be taken to mean two major language clusters mentioned above: Runyoro-Rutooro and Runyankore-Rukiga, denoted by R-R in the following.
Runyakitara is spoken by approximately six and half million (6,500,000) people in nineteen districts of Western Uganda. As a major language in Uganda, some parts of Tanzania and Democratic Republic of Congo, it is important that R-R is given computational attention because it has a large number of speakers, a language of media in western Uganda (two regular newspapers – one online) and a rich history and culture which should be preserved. Besides the language is a medium of instruction in lower levels of primary education in Western Uganda and we shall consider how computational efforts may add value to the language education. The morphology of a verb in R-R, as has been stressed by other Bantu researchers, (Hurskainen, 1992; Elwell, 2005) is one of the complex morphological systems known which means that it needs special attention.
Finite State Methods in Morphological Analysis of Runyakitara Verbs
2. RUNYAKITARA VERB MORPHOLOGY AND THE
COMPUTATIONAL CHALLENGEA verb in a typical Bantu language will take on many prefixes and suffixes. The Runyakitara verb morphology poses the following challenges to computational modeling: a. number of morphemes, b. morpheme order, c. morpheme combination, d. allomorphs, and e. vowel harmony. These are discussed in the sub-sections below.
2.1 NUMBER OF MORPHEMES INVOLVED The Bantu verb template described in many studies suggests about 8 to 15
morpheme slots as follows:
Table 1. Bantu Verb Template (Nurse & Philippson, 2003).
The above generic template raises many questions if one considers it with respect to R-R morphology: what is considered a morpheme on the template? If verb extension, (in Slot 7) is a morpheme, does it mean that such extensions as causative, applicative, passive, etc are allomorphs of the same morpheme? This and many other questions prompted us to devise an R-R verb template to cater more specifically for a number of morphemes present in the language.
There are many morphemes involved in the formation of R-R verbs;
therefore, it is important to expand the template. These can be broadly classified as prefixes, (morphemes left of Slot 0) root (Slot 0) and suffixes (morphemes after Slot 0). The following template shows morphemes involved in the
formation of Runyakitara verbs:
Table 2. Runyakitara Verb Template.
Note: Slot 0 represents root, to the right of 0 are suffixes to the root. Slot 1 is for verb extensions as: Ca – causative, Apl – applicative, Rec – reciprocal, Pas – passive, Int – intensive, Stat – stative, Rev – reversive. Slot 2 represents Verb end: Ind – indicative, subj – subjunctive, past – past tense. Slot 3 indicates post final morphemes: pf1 – post-final 1; pf2 – post-final2. On the left of zero, -1 Asp – aspect, -2 – object pronouns, -3 Tense/aspect markers, -4 – Ng2 – Negative 2, -5 Sp – subject prefix; -6 Asp – aspect; -7 Ng1 – Negative 1. For a more description and examples, refer to Appendix A.
Finite State Methods in Morphological Analysis of Runyakitara Verbs Runyakitara has typical characteristics of template morphology as it is outlined by Spencer. As observed by Spencer (1991), template morphology poses a computational challenge. According to Spencer, template morphology is a morphological system where a verb stem or root consists of obligatory affix(es) as well as a set of optional affix(es). The combinations of morphemes make automatic analysis difficult because one has to sort out first which affixes fit to the root to form specific verb forms.
Adding to the number of morphemes involved, subject and object pronouns mark agreement with the noun classes in question. In case the subject is not included, they serve as subject and object pronouns. These markers appear on the verb root as prefixes to the root. R-R has eighteen (18) noun classes, therefore subject and object pronouns add up to 18 in each case. In addition, R-R is a type 3 language according to the classification given by Maho (2007), which means that it allows two or more objects in the construction. Evidence in Runyakitara shows that the language can have a double object construction, that is, a verb can have a marker for both direct and indirect objects in the same construction. An example in this case is mu-mu-n-kwat-ire (you grab/hold him for me), where mu-n indicate double objects representing him and me. This will add to the number of morphemes, indicating that a number of morphemes is large enough to pause a challenge.
2.3 MORPHEME COMBINATION Much as some studies have been carried out on combination of morphemes in Bantu languages, (Hayman, 2007) limited research is available for Runyakitara morpheme combination. This is specifically in reference to verb extensions. As earlier noted by Hayman, (2007) verb extensions are difficult to analyze mainly because of various functions and also, they are numerous and often occur in long successions. Runyakitara has seven (7) verbal extensions which can be added to the root individually or in combination. For example, one can have a verb with
verb extensions such as:
reeb-a (see) reeb-es-a (see with), reeb-an-a (see each other), reeb-w-a (be seen), reeb-es-an-a (make each other to see), reeb-an-is-a (make to see each other), reeb-es-an-is-ibw-a (be made to make them see each other).
In the last example, [es, an, w, is, ibw] are all verb extensions playing different roles. The order of causative morphs es and is in the above example is different Nordic Journal of African Studies and there is no study available that has established the combination of verbal extensions in Runyakitara, and the order in which they can follow one another.
2.2 MORPHEME ORDER Although the Bantu verb template is presumed to present a fixed order of morphemes, and provides Slot 4 in Table 1, for example, as a slot for tense aspect markers, some morphemes in Runyakitara violate the order. Specific cases are: progressive ni, reflexive e and past ire which violate the order of Bantu template. As indicated on Runyakitara template in Table 2, ni comes before the subject marker in the construction while other tense/aspect markers follow the subject marker e.g.
ni-ba-mu-reeb-a (they are seeing him) ba-ka-mu-reeb-a (they saw him [last year or some months back]).
Ba-mu-reeb-ire (they saw him [yesterday]) In the above verb constructions, ni, ka, and ire are tense/aspect markers but appear in different positions in respect to the root.
Also, the order of verb extensions on the template does not necessarily mean that it is the order of their construction. That is to say, extensions can attach to verbs depending on the argument structure. So, there is not fixed order in which they are supposed to appear in the construction of the verb. For example, a verb root reeb-a (see) reeb-es-a (see with) reeb-an-a (see each other) reeb-es-an-a (make each other to see) reeb-an-is-a (make … to see each other) reeb-an-is-ibw-a (be made to make … see each other).
reeb-er-a (see for) reeb-er-an-a (see for each other) All this indicates that there is a lot of flexibility regarding which morphemes precede and follow one another because is and es are all causatives.
2.4 ALLOMORPHY Runyakitara has various allomorphs, that is, different realizations of the same morphemes. A case in point here is a causative morpheme which has four different realizations [es/is/iz/s/y]. Applicative, passive, stative and reversive morphemes are no exception. All these pose a challenge to computational modeling.
Finite State Methods in Morphological Analysis of Runyakitara Verbs
2.5 VOWEL HARMONY Katamba, (1984) analyses vowel harmony of verb extensions in Luganda, a language closely related to Runyakitara. His analysis, which classifies the vowels involved in harmony as mid and nonmid gives an understanding of existence of vowel harmony in the language but does not aid much when it comes to formalizing morphemes for computational purposes. The reason is that it is difficult to identify the location of mid and nonmid vowels in the string. The suggestion provided by Morris and Kirwan (1972) of a penultimate syllables is useful here. Penultimate syllable is a syllable preceding the final. Penultimate, which means before last, can easily aid one to locate the vowel in question. For example, in the word bo-ro-go-ta, (flow of water) the penultimate is ‘go’. This aided in understanding that when a penultimate syllable is e, o, the causative extension will be es. On the other hand, when the penultimate syllable is a, i or u, the causative extension will be is or iz. The same applies to applicative, intensive and stative.
Given the nature of Runyakitara morphology, it was important to carefully select the formalization approach appropriate to the structure. Therefore, a phrase structure grammar was identified to appropriately handle the concatenative nature of Runyakitara morphology. Rules proposed by Selkirk (Spencer, 1991), were applied, written as W+A for suffixing; and A+W for prefixing. However, it was clear that, the rules Selkirk proposed only account for concatenative nature of morphology. It was important therefore to also think of the way of handling morpho-phonological and orthographical processes.
However, they are helpful for Runyakitara concatenative morphology.