Getting Started

This page provides an overview of the features and functionalities in AutoTransribe.

After AutoTranscibe is integrated into your applications, you can use all of the configured features.

Transcription Outputs

AutoTranscribe returns transcriptions as a sequence of utterances with start and end timestamps in response to an audio stream from a single speaker.

As the agent and customer speak, ASAPP’s automated speech recognition (ASR) model transcribes their audio streams and returns completed utterances based on the natural pauses from each speaker. The expected latency between when ASAPP receives audio for a completed utterance and provides a transcription of that same utterance is 200-600ms.

Perceived latency will also be influenced by any network delay sending audio to ASAPP and receiving transcription messages in return.

Smart Formatting is enabled by default, producing utterances with punctuation and capitalization already applied. Any spoken forms of utterances are also automatically converted to written forms (e.g. ‘twenty two’ shown as ‘22’).

Redaction

AutoTranscribe can immediately redact audio for sensitive information, returning utterances with sensitive information denoted in hashmarks.

ASAPP applies default redaction policies to prevent exposure of sensitive combinations of numerical digits. To configure redaction rules for your implementation, consult your ASAPP account contact.

Visit the Data Redaction section to learn more.

Customization

Transcriptions

ASAPP customizes transcription models for each implementation of AutoTranscribe to ensure domain-specific context and terminology is well incorporated prior to launch.

Consult your ASAPP account contact if the required historical call audio files are not available ahead of implementing AutoTranscribe.

Option

Description

Requirements

Baseline

ASAPP’s general-purpose transcription capability, trained with no audio from relevant historical calls

none

Customized

A custom-trained transcription model to incorporate domain-specific terminology likely to be encountered during implementation

For English custom models, a minimum 100 hours of representative historical call audio between customers and agents

For Spanish custom models, a minimum of 200 hours.

When supplying recorded audio to ASAPP for AutoTranscribe model training prior to implementation, send uncompressed .WAV media files with speaker-separated channels.

Recordings for training and real-time streams should have both the same sample rate (8000 samples/sec) and audio encoding (16-bit PCM).

Visit Transmitting Data to SFTP for instructions on how to send historical call audio files to ASAPP.

Vocabulary

In addition to training on historical transcripts, AutoTranscribe accepts explicitly defined custom vocabulary for terms that are specific to your implementation.

AutoTranscribe also boosts detection for these terms by accepting what the term may ordinarily sound like, so that it can be recognized and outputted with the correct spelling.

Common examples of custom vocabulary include:

  • Branded products, services and offers
  • Commonly used acronyms or abbreviations
  • Important corporate addresses

Custom vocabulary is sent to ASAPP for each audio transcription session, and can be consistent for all transcription requests or adjusted for different use cases (different brands, skills/queues, geographies, etc.)

Session-specific custom vocabulary is only available for AutoTranscribe implementations via WebSocket API.

For Media Gateway implementations, transcription models can also be trained with custom vocabulary through an alternative mechanism. Reach out to your ASAPP account team for more information.

Use Cases

For Live Agent Assistance

Challenge

Organizations are exploring technologies to assist agents in real-time by surfacing customer-specific offers, troubleshooting process flows, topical knowledge articles, relevant customer profile attributes and more. Agents have access to most (if not all) of this content already, but a great assistive technology makes content actionable by finding the right time to bring the right item to the forefront. To do this well, these technologies need to know both what’s been said and what is being said in the moment with very low latency.

Many of these technologies face agent adoption and click-through challenges for two reported reasons:

  1. Recommended content often doesn’t fit the conversation, which may mean the underlying transcription isn’t an accurate representation of the real conversation
  2. Recommended content doesn’t arrive soon enough for them to use it, which may mean the latency between the audio and outputted transcript is too high

Using AutoTranscribe

AutoTranscribe is built to be the call transcript input data source for models that power assistive technologies for customer interactions.

Because AutoTranscribe is specifically designed for customer service interactions and trained on implementation-specific historical data, the word error rate (WER) for domain and company-specific language is reduced substantially rather than being the subject of incorrect transcriptions that lead models astray.

To illustrate this point, consider a sample of 10,000 hours of transcribed audio from a typical contact center. A speech-to-text service only needs to recognize 241 of the most frequently used words to get 80% accuracy; those are largely words like “the”, “you”, “to”, “what”, and so on.

To get to 90% accuracy, the system needs to correctly transcribe the next 324 most frequently used words, and even more for every additional percent. These are often words that are unique to your business---the words that really matter.

To ensure these high-accuracy transcript inputs reach models quickly enough to make timely recommendations, the expected time from audio received to transcription of that same utterance is 200-600 ms (excluding effects of network delay, as noted in Transcription Outputs).

For Insights and Compliance

Challenge

For many organizations, lack of accuracy and coverage of speech-to-text technologies prevent them from effectively employing transcripts for insights, quality management and compliance use cases. Transcripts that fall short of accurately representing conversations compromise the usability of insights and leave too much room for ambiguity for quality managers and compliance teams.

Transcription technologies that aren’t accurate enough for many use cases also tend to be employed only for a minority share of total call volume because the outputs aren’t useful enough to pay for full coverage. As a result, quality and compliance teams must rely on audio recordings since most calls don’t get transcribed.

Using AutoTranscribe

AutoTranscribe is specifically designed to maximize domain-specific accuracy for call center conversations. It is trained on past conversations before being deployed and continues to improve early in the implementation as it encounters conversations at scale.

For non real-time use cases, AutoTranscribe also supports processing batches of call audio at an interval that suits the use case.

Teams can query AutoTranscribe outputs in time-stamped utterance tables for data science and targeted compliance use cases or load customer and agent utterances into quality management systems for managers to review in messaging-style user interfaces.

AI Services That Enhance AutoTranscribe

Once accurate call transcripts are generated, automatic summarization of those customer interactions becomes possible.

ASAPP AutoSummary is a recommended pairing with AutoTranscribe, generating analytics-ready structured summaries and readable paragraph summaries that save agents the distraction of needing to write and submit disposition notes on every call.