Skip to main content Skip to main navigation

Publikation

Benchmarking Generative AI Performance Requires a Holistic Approach

Ajay Dholakia; David Ellison; Miro Hodak; Debojyoti Dutta; Carsten Binnig
In: Raghunath Nambiar; Meikel Poess (Hrsg.). Performance Evaluation and Benchmarking: 15th TPC Technology Conference, TPCTC 2023, Vancouver, BC, Canada, August 28 - September 1, 2023, Revised Selected Papers. Technology Conference on Performance Evaluation and Benchmarking (TPCTC), Pages 34-43, Lecture Notes in Computer Science, Vol. 14247, Springer, 2023.

Zusammenfassung

The recent focus in AI on Large Language Models (LLMs) has brought the topic of trustworthy AI to the forefront. Along with the excitement of human- level performance, the Generative AI systems enabled by LLMs have raised many concerns about factual accuracy, bias along various dimensions, authenticity and quality of generated output. Ultimately, these concerns directly affect the user’s trust in the AI systems that they interact with. The AI research community has come up with a variety of metrics for perplexity, similarity, bias, and accuracy that attempt to provide an objective comparison between different AI systems. How- ever, these are difficult concepts to encapsulate in metrics that are easy to compute. Furthermore, AI systems are advancing to multimodal foundation models that further make creating simple metrics a challenging task. This paper describes the recent trends in measuring the performance of foundation models like LLMs and multimodal models. The need for creating metrics and ultimately benchmarks that enable meaningful comparisons between different Generative AI system designs and implementations is getting stronger. The paper concludes with a discussion of future trends aimed at increasing trust in Generative AI systems.

Weitere Links