Evaluating multiple training configurations for automated POS tagging and parsing of palenquero creole
Evaluating multiple training configurations for automated POS tagging and parsing of palenquero creole
Enllaç permanent
Descripció
Resum
Palenquero is an endangered creole language spoken in Colombia. Traditionally, scholarly research of Palen quero creole has mainly focused on describing its linguistic features, its lexicon, and its origin. This project explores the possibilities that NLP applications offer for low resource languages, like Palenquero. The goal of this project is to test four different POS tagging and dependency parsing configurations: non-sequential, word lists, batches of sentences, and sequential batches of sentences and word lists. These configurations were fed to the Arborator-Grew online tool to find out which one of these reached the best accuracy rates at automatically tagging and parsing Palenquero text. To train the parsing tool, word lists from authoritative dictionaries and sentences from a recent survey of the language were adapted to the Universal Dependencies (UD) framework and stored in CoNLL-U files. The results from the parsing were measured using Labelled Attachment Scores (LAS) and Epochs. The parser’s ability to extrapolate its learning to unseen text in Palenquero was measured using a modified version of the Unlabelled Attachment Score (UAS), where only the main head verbs of sentences from a cross-validation test set were considered. The results present the best LAS and Epoch for each training attempt at each training configuration. A generalised upward trend in LAS as training attempts increased was observed. Non-sequential training offered best results, and sequential training with batches of sentences and word lists ranked second. Cross validation test confirmed these results. Results demonstrate that a combination of rule-based and context-based approaches offer the best results at training the Arborator-Grew tool. These results indicate the potential of ready-made digital linguistic tools, combined with diverse training data, to significantly improve the automated parsing of lesser-studied languages, like Palenquero.Descripció
Treball de fi de màster en Lingüística Teòrica i Aplicada. Director: Geraint Paul Rees