As more language service providers (LSP) are including post-editing (PE) of machine translation (MT) in their workflow, we see how studies on quality evaluation of MT output become more and more important. We report findings from a user study that evaluates three MT engines (two phrase-based and one neural) from French into Spanish and Italian. We describe results from two text types: product description and blog post, both from a motorcycling website that was actually translated by Datawords Datasia. ...
As more language service providers (LSP) are including post-editing (PE) of machine translation (MT) in their workflow, we see how studies on quality evaluation of MT output become more and more important. We report findings from a user study that evaluates three MT engines (two phrase-based and one neural) from French into Spanish and Italian. We describe results from two text types: product description and blog post, both from a motorcycling website that was actually translated by Datawords Datasia. We use task-based evaluation (PE is the task), automatic evaluation metrics (BLEU, edit distance, and HTER) and human evaluation through ranking to establish which system requires less PE effort and we set the basis for a method to decide when an LSP could use MT and how to evaluate the output. Unfortunately, large parallel corpora are unavailable for some language pairs and domains. Motorcycling and the French language are low-resourced, and this represents the main limitation to this user study. It especially affects the performance of the neural model.
+