Automate the voice channel with natural language understanding

Photo by nappy from Pexels

Photo by nappy from Pexels



Most businesses handle customer or supplier interactions via phone and text every minute of every day. Many leverage Integrated Voice Response (IVR) technology to help find the right employee to have the necessary conversation, or even fully automate the interaction.

If you’ve used a phone, it is almost impossible to avoid having experienced a bad IVR. In large organisations the amount of information that needs to be collected to find the right team and the team-member with the right skills can be extensive. For a payment transaction, the client number together with card numbers might easily total thirty plus digits. If you’d like to schedule an appointment with your medical practitioner while driving it can feel like an impossible task.

Yet arranging appointments and far more complicated matters is straight forward if you can use language. Beyond what’s called “closed grammars” though, that’s not something widely supported by most IVRs. A closed grammar is one where the system knows what to expect of a limited set of options. “Yes, yeah, no, nope” would be a simple example of a closed grammar for a question with a yes / no response. Although the game of “twenty questions” shows that you can learn a lot with just yes / no answers, it still doesn’t match the full power of natural language.

Advances in computer processing power and deep learning have meant that systems that can transcribe voice in a speaker independent, and parse text in to extract intent and content are now practical. This technology can be used to revolutionize the much maligned IVR and create intelligent virtual agents that allow enterprises to build better conversations with customers.

Leveraging natural language to solve real business problems has typically followed an intensive, high up-front cost. When first made commercially available, an implementation would typically follow these phases

  1. Identify an interaction type, for instance, inbound call steering, or appointment setting.

  2. Define a goal, e.g. IVR containment rate > 95%

  3. Code an IVR to collect and process user input (and allow a live agent to listen and choose outcome “wizard of oz” style).

  4. Gather 10,000 to 100,000 customer interactions in this “wizard of oz” approach. Customer input is be solicited in the same way as the intended interaction, for instance for a call steering application “Can you tell me in a few words tell me what your call is about today?”  The response is recorded, and the live agent chooses an outcome (also recorded).

  5. Transcription and further annotation of the recording is made by analysts to create a training set

  6. A statistical language model would be trained to count the words typically heard from callers as well as the sequence of words (N-gram).

  7. A statistical semantic model would be trained based on transcribed words to determine intent and content.

  8. The system can be tested against a portion of the data with-held from training. If performance is adequate, the system can be deployed. If it is not, then more data must be gathered from some weeks or months of operation, or the project abandoned.

  9. Incremental training might be undertaken.

The advent of major cloud vendors speech recognition solutions, trained on the massive data-sets they have acquired, provides another way. When coupled with the same vendors text analysis tools such as Dialogflow, functional natural language interfaces can be built swiftly with minimal upfront cost. Better conversations are now within reach of all enterprises, not only the Fortune 500. This same solutions can also unlock the use of natural language in the more complex interactions, such as booking that medical appointment.

The new model of building natural language interfaces looks like this. Steps 1 & 2 are unchanged

  1. Define a virtual agent task that collects open speech, links to Dialogflow and acts on the outcome

  2. Build a Dialogflow virtual agent by defining intents and content

  3. Distribute some calls to the prototype

  4. Train with tens not tens of thousands of input

  5. Repeat steps 3 - 4 monitoring progress towards goal and either “go-live” or terminate when rate of improvement is not promising.

To understand how Inference Solutions integrates the latest Natural Language technology, view our webinars below.