Skip to content

Voice

How to perform LLM Evals

Evaluating LLM pipelines/workflows improvements is key to improving our AI systems. With limited time and resources, oftentimes the blocker is: overthinking them. In this article, I'll talk through a couple simple evals for benchmarking your improvements, based on work I've done previously.

The Curse of Overthinking

I've learnt that evals really can be as simple as an assert statement. The goal here is doing quick "smoke tests" to ensure that your pipeline is working as expected, whilst accounting for stochasticity. From that point, complexity is incrementally earned by structuring your evals around your most common & important failure modes.

If these failure modes aren't immediately apparent to you yet, then hunt for them first.

Working with Voice AI is Easy, Try This

Getting started with Voice AI can be easy. It's important to start simple to progressively build an understanding of what happens under the hood. By doing so, we can build a solid foundation for more complex applications. In addition, starting simple and adding complexity slowly helps us compare and appreciate the delta between demo-land & deployments into production.

Let's explore the following:

  1. Simple Speech-to-Text (STT)
  2. STT + Structured Outputs (e.g. JSON, Pydantic)
  3. Evals for Audio

For starters, let's keep it simple with a basic speech-to-text provider.