Evaluating Chatbot Systems


A chatbot system is a computer program that supports spoken or text-based interactions with humans. We can generally divide them into task-oriented and non-task-oriented chatbot systems. In task-oriented dialogues, the human user and the system engage in an interaction to accomplish some tasks. In non-task-oriented chatbots the human user and the system engage in general conversational interaction. We will mainly focus on the task oriented chatbot system in this article as it lies close to the scenario, we have at my workplace.

Importance of Evaluation

The article will try to define some guidelines on how to carry out a basic evaluation process for the Chatbot under test.

This derived checklist is important for several reasons listed below:
· To determine whether the system performs as expected.
· To determine whether the system meet’s the user’s needs — whether it understands their utterances, it helps them resolve their queries efficiently and accurately and at the same time gives them a satisfactory experience.
· To establish whether the aims of the business have been met, for example, to minimize the cost or to understand whether the system shows improvement on various evaluation metrics over a period.

One of the main goals of this approach to evaluation is to have a procedure that is repeatable, and that highly drives human judgements in approving the vendor system for onboarding.

Main Points to Check during the evaluation phase

Let’s look at a few capture points, which will help us gather insights from bot Conversations and there by evaluate their performance and ease of integration.

Data Capture Points to Gather Insights

We should make sure that we capture all data-points on what our customer said and when did he/she said and where did he/she say. This data will mainly include things like session Id, Time Stamp, Sentiment Score, Language, Keywords (Topic Mining), Intents (confidence, intent, was it a bot break or fall back, etc.), Platforms (like web, mobile, basically which channel). So, in a nutshell, we would like to know if we already have this information captured in the vendor portal and can this be pulled outside the portal for use by other systems in our ecosystem. We should try to keep all the chat information somewhere in our data warehouse later for pulling some insights or for creating an omni-channel experience. For this to happen we might need to integrate with the vendor system system either using a pub/sub channel or a routine pull mechanism.

Collecting some form of Customer Rating Metrics

We should be able to know what our customers feels about the chatbot which they are interacting with. This can be achieved by creating some form of a dashboard with a live view of how well the chatbot is interacting with the customers and how does the customers feel. We will have to keep note that sometimes these scores can be a little extreme, but still just for the understanding of the customers we might need something like this at the command center. Basically, things like NPS (Net Promoter Score, how loyal the customer is) or CSAT (Customer Satisfaction Score, which basically calculates the happiness of the customer with a service) or CES (Customer Effort Score, how easy or the experience to use the service). Some customers might have their own scores to measure the same, which is also acceptable as they might also be a good metric to see in Live as we will get to know how the bot experience is to the users. So, we need to know if the vendor provides this kind of an interface and view for the channels where it is deployed and how quick are the updates on the scores.

Monitoring the Chat Sessions and Funnel Metrics

We should be able to know how to measure the performance of the chatbot for the specific task by measuring things like task completion, dialogue duration, user satisfaction, etc. In this context, a funnel is like the steps the user needs to go before reaching the end of conversion. For example, the customer starts a chat session to upgrading the package via the bot, this chat funnel will be all the steps he/she has to take to upgrade his subscription package and finally end the conversation confirming the upgrade or its initiation. This chat funnel is important as this will help us to find the bottlenecks in the conversation and resolve them immediately if there is any issue. We will also need to know how the session ended with the user and they needed to be added to the session related metrics. We would also like to view the most common customer journey happened via the chatbot so we know what we can do there to improve those journeys, so it helps lots of users in one shot. So, we need to make sure that the vendor provides some framework to do this monitoring.

Metrics which can help us Monitor the Chat Sessions and Funnel

These are some of the metrics which we can collect from the end-user to monitor and improve the chat bot experience.
Total Usage in the chat bot channel with some information on the below metrics at a task level:
— Number of Successful Tasks
— Total Duration spend on each of the Tasks
— Number of System turns
— Number of User turns
— Correction Rate is the Number of utterances repaired or corrected by the system.
Percentage of Users that Matches the different Intents we have modelled.
Percentage of Users that didn’t Match the Intents, this can be things which the chat bot is not trained to answer.
Completion Rate — like how many customers completed the full journey.
Drop Off Rate or Bounce Rate — like how many users basically dropped out without completing the full journey.
Drop Off Place is where in the journey have, they dropped out from interacting with the chatbot. This can also be points were the user asked to chat to a real agent somewhere midway in the journey.
Reuse Rate is how many customers come back and use the chatbot again. This can be used later for targeted campaigns.

Some other metrics which can help us understand the chatbot system itself are shown below:
Endpoint Health, to make sure the bots don’t crash or act weird even if one of the integrated service is down or can’t respond.
Latency in the interactions with the bot, like time to respond to the message and retrieve information and forms. How much latency does the customer experience?
Number of Timeouts which happened in the tasks.
Discovery, like how did the customer find the virtual agent and how easy is it to find and start using the agent.
Errors — all errors in the platform needs to capture in some logging systems

Some other metrics terms which we can look at for checking the performance of the chat bot in minimizing the cost and ensuring user satisfaction are:
Time-to-task is the measure of the amount of time that it takes to start engaging in a task after any instructions and other messages provided by the system.
Correct Transfer Rate measures whether the customers are correctly redirected to the appropriate human chat agent in case of a fallback or on a request by the user.
Containment Rate measures the percentage of chats not transferred to human agents and that are handled by the system. This metric is useful in determining how successfully the system has been able to reduce the costs of customer care through automation.
Abandonment Rate this metric is the opposite of containment rate. It measures the percentage of users who hang up or disconnected the chat session before completing a task with an automated system. This is same is the Drop Off Rate mentioned above.

Metrics which can help us Monitor the Bot Health and NLU Quality

Basically, these metrics focus on the Chatbot building process to improve the quality of the underlying NLU models we have created. Normally this will be done by the vendor, but we should use a test-validation data from our previous live data to do some basic testing or to do a regression testing every time we deploy a new model to production to improve the model quality. We mainly want to focus on the Intent Matching ability of the deployed chat bot. This Intent Classification or Matching and Entity Extraction can be evaluated using a confusion matrix that shows how many items were correctly identified and how many were mistaken for other items. Based on a confusion matrix, the Precision, Recall, and F1 measures can be calculated. Some of the common metrics we can use to do this evaluation are shown in the confusion matrix diagram below:

· True Positive Rate, is a ratio to determine if intents are too narrowly defined and thereby missed requests

· False Positive Rate is a false alarm ratio. How many times the bot misunderstood requests and responded with wrong answers to total misunderstood requests

So, we need to make sure we have provisions for this automated evaluation mechanism in the vendor portal which also can be used for canary testing if required.


Evaluation of dialogue systems is an important and complex task where there are many different issues to consider, such as what sort of metrics are required for the chatbot system. They are important to evaluate the quality of the conversation and satisfaction of the users. Some of these metrics and interfaces are important in understanding the ease of integration and level of support required in our ecosystem.


Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots by Michael McTear

The Definitive Guide to Conversational AI with Dialogflow and Google Cloud by Lee Boonstra




Software Engineer | Java, Python, Linux, Unix | AI, DVB | 💻 | Azure | PyTorch | Hackathons | Innovations | Highly Inquisitive and Curious

Love podcasts or audiobooks? Learn on the go with our new app.

Covid-19 - What is The New Normal behaviour?

When the diversity imbalance becomes a problem for AI

DataTrends 2022 — by Virginie Marelli

What Is Latent Semantic Indexing? (And Why It Won’t Help Your SEO)

Is your support team ready for automation?


AI and Work Management: When Talent Meets Technology

Nvidia has announced Jarvis beta

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Navaneeth Sen

Navaneeth Sen

Software Engineer | Java, Python, Linux, Unix | AI, DVB | 💻 | Azure | PyTorch | Hackathons | Innovations | Highly Inquisitive and Curious

More from Medium

Conversational AI using only natural language

Why conversational chatbot are future

European chatbot conference: 10 talks that I’m going to rewatch

Open Domain Question Answering Series — (Part 3: Introduction to Knowledge Graphs for Question…