Large language model applications for evaluation: Opportunities and ethical implications: Fid-Bau Portal

Large language model applications for evaluation: Opportunities and ethical implications

Head, Cari Beth / Jasper, Paul / McConnachie, Matthew / Raftree, Linda / Higdon, Grace

Large language models (LLMs) are a type of generative artificial intelligence (AI) designed to produce text‐based content. LLMs use deep learning techniques and massively large data sets to understand, summarize, generate, and predict new text. LLMs caught the public eye in early 2023 when ChatGPT (the first consumer facing LLM) was released. LLM technologies are driven by recent advances in deep‐learning AI techniques, where language models are trained on extremely large text data from the internet and then re‐used for downstream tasks with limited fine‐tuning required. They offer exciting opportunities for evaluators to automate and accelerate time‐consuming tasks involving text analytics and text generation. We estimate that over two‐thirds of evaluation tasks will be affected by LLMs in the next 5 years. Use‐case examples include summarizing text data, extracting key information from text, analyzing and classifying text content, writing text, and translation. Despite the advances, the technologies pose significant challenges and risks. Because LLM technologies are generally trained on text from the internet, they tend to perpetuate biases (racism, sexism, ethnocentrism, and more) and exclusion of non‐majority languages. Current tools like ChatGPT have not been specifically developed for monitoring, evaluation, research, and learning (MERL) purposes, possibly limiting their accuracy and usefulness for evaluation. In addition, technical limitations and challenges with bias can lead to real world harm. To overcome these technical challenges and ethical risks, the evaluation community will need to work collaboratively with the data science community to co‐develop tools and processes and to ensure the application of quality and ethical standards.