Developing a Hybrid Morphological Analyzer for Low-Resource Languages

Document Type

Article

Publication Title

Applied Sciences Switzerland

Abstract

Morphological analysis is the fundamental and preliminary task for Natural Language Processing (NLP) applications, which involve speech and language. Kannada is a low-resource language belonging to the Dravidian language family, which is highly agglutinative and morphologically rich in nature, where dataset development is happening rapidly due to the increasing demands of NLP tools. This study presents a hybrid approach that integrates rule-based and Transformer-based techniques, aiming to maximize their strengths while minimizing the respective limitations. In the Kannada language, the analysis of inflections has been challenging due to morphological richness, and to address this issue, 85 paradigms are created using Lttoolbox of Apertium. Further, a Transformer model is trained with the generated nominal data to generate the morphological analysis for the out-of-vocabulary inflections. The hybrid approach can be easily extended to new words as they are added to the dictionary. The obtained results are on a test set for inflections in Kannada precision: 0.924; recall: 0.925; and F1 score: 0.925. The main contributions include rule extraction for paradigm design at the word level, morphological analysis for nouns, verbs, adjectives, pronouns, and indeclinables on a benchmark dataset and morphological analysis generation using the Transformer architecture.

DOI

10.3390/app15105682

Publication Date

5-1-2025

This document is currently not available here.

Share

COinS