• Chitra Sharathchandra

Audio data with deep learning - the next frontier

Audio waveform


Artificial intelligence (AI) has become an inherent part of our lives and machine learning is driving to solve new problems every day. With the recent advancements in deep learning, most AI practitioners agree that AI's impact is exponentially expanding year after year. This, of course, has been with the help of big data and unstructured data. These two types of data are not synonymous. Big data can be structured or unstructured and has three characteristics - volume, velocity (speed with which it arrives) and variety (different types like pictures, videos etc.) in data. Unstructured data, on the other hand, refers to data that is not organized or repetitive and is oftentimes large in volume and velocity and thus can be synonymously referred to as big data.

Two well-known examples of unstructured data are images and text. Image data is used to solve complex computer vision problems such as facial recognition and autonomous cars. Text data, on the other hand, is used to solve Natural Language Processing (NLP) problems such as understanding spoken language or translating from one language to another (refer to our NLP blog here). Because of its applications, image and text data have received a lot of attention.

Along with images and text, there's a third type of unstructured data - audio data. Audio data is less well known, and we'll be diving into it in this writing. This type of data comes in the form of audio files (.wav, .mp3 etc.) or streaming audio. Most applications of audio data are in the music domain in the form of music cataloging or lyric generation. Complexity of audio data has limited it from finding mainstream applications. This has changed with the rapid development in the field of deep learning.

Audio data applications

Audio data is used to build AI models that perform automatic speech recognition (ASR). ASR solves problems such as understanding what is said to a voice assistant such as Alexa or Siri or converting speech to text for applications such as voice bots and automatic medical transcription. In addition to ASR, audio data is also used to solve problems such as speaker detection, speaker verification and speaker diarization. Applications of speaker detection include Alexa’s ability to change responses based on who is speaking or identifying speakers in a live-streaming audio or video. An application of speaker verification is biometric security. Speaker diarization refers to separating audio to identify who is speaking “what” and “when”. A common application of speaker diarization is transcribing meeting recordings or phone conversations by speaker. As this technology ripens, many more applications are possible that would be based on conversations between people e.g., automatic test result generation for school verbal tests, mental health diagnosis based on conversations.

Features of audio data

Unlike text and image data, audio data has hidden characteristics in its signal which tend to be more difficult to mine. Most audio data available today is digitized. The digitization process stores audio signals by sampling them. The sampling rate varies by the type of media. For example, CD quality audio uses a sampling rate of 44,100. This means that audio data is sampled 44,100 times in a second and stored in a digital format. Each sample value represents the intensity or amplitude of the sound signal. This sampled data can be processed further to extract features depending on what kind of analysis is required. Spectral features that are based on the frequency domain are probably the most useful. Examples of such features and their applications are as follows (there are many more):

1. Mel Frequency Cepstral Coefficients (MFCC) - represents the envelope of time power spectrum which represent sounds made by a human vocal tract

2. Zero crossing rate – used to detect percussive sounds and a good feature for classifying musical genres

3. Average Energy – can represent formants which uniquely identify a human voice

4. Spectral entropy – used to detect silence

A speech model will extract the above features depending on the application and use them in a supervised or unsupervised machine learning model or in a deep learning model.

Models for speaker detection, verification and speaker diarization

Models for speaker detection and speaker verification are classification problems. For speaker detection, audio features must be extracted for each speaker. The audio feature data can be fed to a neural network which can then be trained.

Models for speaker detection and verification

Models for speaker diarization have historically been unsupervised clustering problems but newer models are based on neural networks.

Speech model performance

Performance of speech models have yet to overcome the following challenges: (1) poor accuracy in recordings of people of the same gender or of people with different accents (2) poor accuracy of speech to text due to language complexities and (3) inability to deal with background noises. The first challenge can be overcome with more training data. New methodologies in bringing together acoustic data and text data is addressing the second challenge. Speech denoising (removal of background noise) is another area that requires a lot of noise and quiet speech samples.

Overall, one can expect that speech models will perform better with more varied data. This is the case for deep learning models in other areas as well. As building complex deep learning models becomes easier with various frameworks, a majority of the work is in understanding and preparing the data.


Audio data is coming into prime time along with its cousins – image and text data. The main driver for it has been deep learning. Applications such as voice assistants and voice bots have entered the mainstream due to this technology development. With a broad spectrum of models in the area of ASR, speaker detection, speaker verification and speaker diarization, we can expect a larger array of conversation-based applications. Crossing this frontier will require integrating multiple types of data and preparing them well so that they are ingested by advanced models to produce good predictions.

About Cedrus: Cedrus Digital is involved in studying audio and conversation data and provides strategies on how information can be harvested from them. Cedrus Digital provides analytics and data science services to gain visibility into high volume call center inquiries – creating opportunities for process efficiencies and high value Conversational AI call center solutions and supplementation.

Chitra Sharathchandra is a Data Scientist who enjoys working on implementable AI solutions related to multiple types of data