Can we trust large language models to summarize food policy research papers and generate research briefs?
Abstract
Generative large language models (LLMs), while widely accessible and capable of simulating policy recommendations, pose challenges in the assessment of their accuracy. Users, including policy ana lysts and decision-makers, bear the responsibility of evaluating the outcomes from these models. A significant limitation of LLMs is their potential to overlook critical, context-specific factors. For example, in formulating food policies, it is vital to consider regional climate and environmental variables that influence water and resource availability. Nonetheless, due to their reliance on word sequencing probabilities from training datasets, LLMs might propose similar policies for distinct regions. Despite these limitations, LLMs offer considerable advantages for rapid policy analysis, particularly when resources are constrained. They serve as quick, accessible, and cost-effective tools for policy research and development, requiring minimal training and infrastructure. In our study, we assessed the efficacy of LLMs in generating policy briefs by inputting an IFPRI discussion paper into three different LLM-based approaches: a standard chatbot without extra data, a Retrieval Aug mented Generation model integrating semantic search with LLM, and a custom-developed Brief Generator designed to create policy summaries from AI-analyzed paper structures. Our findings revealed that none of the LLM-generated briefs fully captured the original paper's intent, underscor ing the need for further research. Future investigations should focus on gathering more empirical data with diverse text types and volumes to better understand these outcomes.