Automated Essay Scoring (AES) systems present an efficient solution for assessing writing proficiency in high-volume educational settings, though concerns about their accuracy and fairness persist. This study investigates the use of ChatGPT 4.0, a powerful large language model (LLM), as an AES tool for English as a Second Language (ESL) essays. Utilizing the English Language Learner Insight, Proficiency, and Skills Evaluation (ELLIPSE) corpus, which includes 6,482 essays across 44 prompts, we analyzed a subset of 1,154 essays to ensure robust statistical analysis.
We developed a custom Python application to interface with the ChatGPT API, varying the temperature parameter (0.5 and 0.7) to assess its impact on scoring consistency and accuracy. Each essay was scored twice by ChatGPT, and these scores were compared to human ratings using Spearman's rank correlation and the Wilcoxon signed-rank test.
Results showed a positive correlation between ChatGPT scores and human ratings, suggesting the model captures some aspects of essay quality; however, a consistent underestimation bias was noted. Correlation coefficients ranged from 0.509 to 0.656, highlighting limitations in the model's ability to reflect human judgment. Further research is needed to mitigate this bias and enhance the accuracy of LLM-based AES systems.
-
Howard Hao-Jan Chen is a distinguished professor of the English Department at National Taiwan Normal University, Taipei, Taiwan. Professor Chen has published several papers in CALL Journal, ReCALL Journal, and several related language learning journals. His research interests include computer-assisted language learning, corpus research, and second language acquisition.