Abstract
With the primary focus on evaluating the effectiveness of large languagemodels for automatic reference-less translation assessment, this work presentsour experiments on mimicking human direct assessment to evaluate the quality oftranslations in English and Indian languages. We constructed a translationevaluation task where we performed zero-shot learning, in-contextexample-driven learning, and fine-tuning of large language models to provide ascore out of 100, where 100 represents a perfect translation and 1 represents apoor translation. We compared the performance of our trained systems withexisting methods such as COMET, BERT-Scorer, and LABSE, and found that theLLM-based evaluator (LLaMA-2-13B) achieves a comparable or higher overallcorrelation with human judgments for the considered Indian language pairs.