While recent studies have explored how learner corpora can help teachers develop materials that meet learners’ needs (Brezina et al., 2022), there remains a need for targeted learner corpora that provide insights into specific groups of learners (Götz & Granger, 2024). However, cleaning a corpus for analysis is time-consuming (Brezina et al., 2019). This presentation demonstrates a Large Language Model (LLM) powered corpus cleaning workflow that can address this by using advances in LLMs and Natural Language Processing (NLP) tools like spaCy and Stanza to streamline the process.
Our approach addresses this by integrating LLMs with NLP libraries to identify spelling errors, classify words (e.g., proper nouns, technical terms, foreign words), and apply structured markup. Leveraging API-based processing from modern LLMs like Claude or ChatGPT, this approach allows these LLMs to assist with the systematic analysis and cleaning of a corpus.
This presentation showcases the workflow in action. By using a subset of texts from our existing learner corpus, along with a cleaned and annotated gold-standard version of these texts, we will illustrate how LLMs facilitate preprocessing and structuring learner corpora. The results suggest that this method enhances efficiency and consistency, allowing researchers to focus on linguistic analysis rather than data cleaning.