Ask an Analyst: Christina McAllister on the Challenges of Getting Gen AI Summaries to AgentsAsk an Analyst: Christina McAllister on the Challenges of Getting Gen AI Summaries to Agents
When it comes to Gen AI after-call summaries, there’s a trade-off between latency and accuracy that must be delineated and resolved.
July 24, 2024
Editor’s Note: No Jitter asked Christina McAllister, a senior analyst at Forrester, for her thoughts on some of the challenges associated with delivering post-call, generative AI-powered summaries to contact center agents. Her comments also provided insight on the recent Amazon Connect Contact Lens news.
So, according to McAllister, here are some of challenges – the biggest dependencies contributing to latency (and therefore availability of the post-call summary) are:
Transcript availability / accuracy.
Large language model (LLM) size/scope.
Latency versus Accuracy
Whenever you’re looking at AI in a voice/speech use-case, it’s critical to remember that before you can do anything with an LLM (or even apply traditional NLP or predictive models), you must first convert the audio into text. Accuracy is always a challenge for audio transcription, but this is an especially important call-out because transcribing in real-time is tricky – there is constant tension between accuracy and latency. Speech-to-text (STT) models are best when they have as much of the conversational context as possible to accurately interpret and transcribe.
For true real-time use-cases where the goal is to trigger certain actions off the transcript as the call is progressing – for example, providing scripting or a suggestion to the agent – best-in-class solutions support ultra-low latency (in the hundreds of milliseconds) with highly optimized transcription models that continuously process audio segments in real-time as the audio is streamed. But this technique requires a trade-off: real-time transcription, while enabling the system to deliver transcribed text with minimal delay, limits the amount of contextual information the system can work with at any given time. The shorter the buffer and the more you push for low latency, the more you sacrifice accuracy.
Beyond the transcript, any kind of workflow or processing step (e.g., hitting any kind of API, maybe fetching data about the specific customer to tailor a real-time recommendation) adds to the turnaround time of whatever output you are expecting. Additionally, bringing generative AI (Gen AI) into the mix means introducing an additional latency-contributing step – large language models (LLMs) take time to process the text and generate the content.
So, to make things move faster, it’s important to look at each step and decide what trade-offs we are comfortable making:
Push transcription speed: This sacrifices transcript accuracy which sacrifices downstream utility (garbage in, garbage out).
Limit the workflow steps: Maybe the solution doesn’t need to access any back-end data in real-time, which will limit personalization, relevance, etc.
Use a smaller LLM with fewer parameters: Some models are much more suitable for real-time processing OR streamline the desired output (i.e. shorter summary)
All of this is background context that is worth understanding because it colors how vendors have approached deploying Gen AI summaries.
Acceptable Latency Varies by Use Case
Gen AI summaries can be deployed as standalone solutions – they are often the lowest hanging fruit or “first step” project that enterprises pursue with Gen AI. But they are often paired alongside other related solutions like real-time agent assist/copilot, real-time conversation analytics, etc., which is where the real-time STT comes into play.
If the enterprise wants their agent to be able to review and/or edit the summary, it must be available for the agent immediately at the close of the call. A couple seconds of latency is expected for the full transcript to be ready and subsequently processed by the LLM, but faster is always better in the contact center.
If the vendor has not deployed STT for streamed, real-time audio – which is common if, before then, the vendor had only supported post-call analytics or similar – then there will be a greater delay. This is often very challenging for solution adoption as every precious second eats into the projected time/cost savings of the Gen AI summary solution.
Some contact center as a service (CCaaS) vendors have a real-time streaming API (e.g., Five9 announced theirs in 2020) which makes integration relatively easy. This is harder – but not impossible – when integrating with environments that do not have this kind of streaming API (some vendors, like Cresta, have developed integration techniques that support more challenging environments).
Acceptable latency really is dependent on the situation: for true real-time use-cases where recommendations are made to agents as the call happens, even a few seconds is too long (which is why real-time agent assist/copilot is so hard to get right at scale). But just looking at the call summary on its own, I advise clients that a few seconds is acceptable if it results in a more accurate summary.
Some companies choose to process summaries post-call and submit to CRM (or elsewhere), fully lifting that burden from agents (versus having them review/edit the summary). As with anything, there are pros and cons to this decision.
Final Thoughts: It’s All About Trade-offs
After-call work (ACW) time savings are maximized if the agent review step is skipped, but there may be errors in the summary. For many companies, the agent-completed call notes are not particularly accurate or consistent, so this may be an acceptable trade.
However, if agent-led call notes include things that are not part of the transcript (e.g., steps an agent takes in other systems, or other things that were not verbalized) then those will absolutely be missing from the transcript if they are not able to submit additional context at the close of the call. So, in short, there are lots of decisions and trade-offs to be made.