In speech processing, an important task encountered most often is the removal of noise from the given audio signal. Most of the time it happens that we are in situations where we want to receive a call (Phone/Skype/Web Ex) but the surroundings are not conducive. For example, taking a phone call in traffic or taking a Web Ex call in Cafeteria or Restaurants. One obvious  solution is perform signal processing on the input to remove/reduce the noise factor from the signal. This blog post is the summary of the deep-learning based  technique used to de-noise the given audio. The audio signal in study consists of human speech and various background noises.  This was a team submission to Infinity 2k19 finals.  In order to make this a fun experience we ended up connecting an audio sentiment analyzer at the end in order to predict the emotional state of the speaker. We thought this could be, in future, used to detect emotions of our clients during a call :P.


Aspiring to have a phone call/conference call anywhere, without worrying whether you are in a quiet room or a flea market for a noise-free, content-rich conversation.

Mathematical Formulation

Given a audio signal x which has speech ß that is corrupted by an additive background signal n such that x=  ß +n. Our goal is to find  de-noising operator g such that g(x)≈ß.

Before we can get into the details of the experiments, let brush up on high level some important concepts which we will require.

Key Concepts

1. Convolution

Convolution is a mathematical operation on two functions (f and g) to produce a third function that expresses how the shape of one is modified by the other.

We see that function g(x)(Red) (a known function) operates on function f(x)(Blue) (again known function) to produce the resultant function (f*g) (Black)(unknown function). Assuming that function f and resultant function f*g is known before hand, can we find the g(x) which will produce f*g. This is the underlying principle of convolutional neural networks, nothing but curve fitting . Finding g(x) is nothing but learning the kernel(g(x)) values, which when convoluted on data produces the desired output.

To connect it to real-world example, we apply a Gaussian filter g(x) on a blurred image f(x) to produce a smooth image (f*g).

Standard Convolution (l=1)

Here we see a kernel of size 3x3 (Gray color) convoluted on a matrix of size 7x7 by moving one index at time from top-left --> bottom-right fashion also known as stride. In deep learning, the task is to learn the weights of the kernel with the help of back-propagation. In-order to fit more complex curves i.e. to achieve more complex operations, let say de-noising, we apply large number of filter array on single input. But this come with a drawback of increased computation complexity. Is there is a way out ?

Dilated Convolution

Dilated Convolution (l=2)

We saw in the previous section how a kernel moves over a given matrix to compute new values. Dilated Convolution is nothing but skipping some of the fixed set of points during convolution. This helps in increasing the receptive field as compared to the standard convolution. At the same time, similar performance can be obtained using fewer number of filters.

For more indepth review check this (article)[]

Lets Get Started on Building Shabda

1. Data Preparation

For the de-noising part, a mixture of two data-sets is used in the experiments, details of which are provided in the tables below. Noises like Traffic, Cafeteria/Babble, White Noise, with various strength(0db to 15db in steps of 5) were added in to the clean speech signals to generate the noisy version. This was followed by re-sampling  of signal at 16kHz and converting to 32bit depth giving us 11,678 total mono audio files.

Denoising Dataset Speaker # of files
Edinburgh DataShare 28 Foreign Speakers 11,572
CMU Speaker Dataset 7 Indian speakers 106
Total 11,678
Sentiment Analysis Dataset Speaker # of files
RAVDESS. 24 Actors 1500
SAVEE. 4 Actors 500
Total 2000

For the sentiment analysis experiment, since both the data-sets did not have same number of classes, we ended up using are Calm, Sad, Fearful, Happy, Angry. The audio signals from both the data-sets are re-sampled to 44 kHz, in order to get them at same scale and are converted to Mel-Frequency Cepstral Coefficients (MFCC). The MFCC representations are then fed into the neural network as features.

2. Training

We train model for both the steps(De-noising & Sentiment Analysis) as separate models.

Overall Architecture

This diagram describes our full system architecture. We divide the system into two parts. The section above the dotted line is the de-noiser. We use an iterator which loops through the dataset, comprising of noisy and clean samples, and feeds it to the model. The implementation is in Tensorflow and uses tf-session for driving the full experiment.  The model consists of 13 layers which are full CNN (Convolutional Neural Nets). The model was trained for 320 epochs running for total 80 hours.  

The section below the denoiser is the Emotion Analyser.  The setup is similar to the one described above. This model is 5 layer convolutional setup. It took 1.6 hours for the model to complete with 200 epochs.

Now lets try to understand how all the parts work together .


The model consists of 13 layers{0,1...12} with the 1st till the second last layer having dilation in the increasing power of i, where i ranges from [0->11]. We have kernel of size 1x3 for all the layers, except last layer which is used to recreate the original signal. We convert the given audio signal into 32 channels, which passed from the start till the second last channel which then maps it to 1 dimensional signal. SAME padding is used to keep the dimensions constant.

This setup is called Full Convolution as it neither uses any max-pooling nor dense layers intermittently or regularly. It is pure convolution layers stacked on top of each other with normalization as regularization mechanism.

The variation of normalization used is called Adaptive Batch Normalization.  It is the called adaptive because it uses two variables  w0 and w1, which are learnt based on batches on every epoch.

Adaptive Batch Normalization = w0*x+w1*batch_norm(x)

The activation used is leaky relu with 0.2 as alpha.

Another interesting gimmick used in the setup is Deep Feature Loss.

Consider a   neural network already trained on a speech classification task.  Let's say classifying various voices of animals to their respective sources. Since the initial layers happen to learn low-level feature representations of given input, these representations tend to be less task-dependent.

Let's denote this pre-trained network as N. Assume N has 32 layers. From this 32 layers, we take few layers lets say 14, call it N[:14]. These are frozen layers and will neither be trained any further nor fine-tuned in anyway whatsoever.

Let's pass a noisy signal through this network N[:14]. When the signal is passed through this network, result will be an activation in each node for every layer in the set of 14. Let us denote output at each of the layers as feat_current[i],where i ranges from [0..13].

Lets repeat the same experiment with clean audio sample and denote it as feat_target[i].

Let's bring in our denoiser  back and have it's weights associated with each layer denoted by loss_weights[i].

for id in range(14)
    loss_vec += l1_loss(feat_current[id],feat_target[id])/loss_weights[id]

We zip the two outputs (feat_current,feat_target) of noisy and clean audio sample and perform l1_loss on the corresponding tensors of the ith layer. Subsequently dividing the value with the loss_weights, in order to incorporate the feedback of the de-noising network trained till now.

Plugging all this together, we use the computed value of loss_vec as the loss for the speech enhancement network. loss_weights are updated on every epoch and used in the calculation. This is a more powerful approach of calculating loss as it incorporates a feedback from an external entity(pre-trained network),this is called as semantic loss, rather than relying on simple l1_loss.

Sentiment analysis is performed on 1 dimensional features (216) obtained from MFCC. These features are then passed through two convolutional layers with 256 and 128 filters with kernel of size 5  followed by dropout and maxpooling of size 8. Before the flatten layer, dense layer and softmax layer, we have two convolutions of size 128 each.

3. Serving

The application works in a offline fashion. That is the full wave file have to be passed into the network for the network to train/predict. It takes around 5 sec for the 10 sec video. In the prediction phase, the audio file runs through pre-processing operations and enters into the de-noiser. The de-noiser clears out the noisiness of the signal and passes it through the Emotion Analyzer.  For processing the videos we strip the audio from the video, process the audio and reattach the audio back to the video.


Below are few samples from which were passed through the full network. First two are video based samples whereas the last one is the audio only example.

As per Shabda, PM Modi’s sentiment: Fearful
As per Shabda, Anils’s sentiment: Fearful
Cafeteria Experiment


  • Dilated Convolution increases the receptive field without increasing the number of filters.
  • Raw data can be directly fed to Neural Net without the use of spectrogram  based or statistical signal processing methods.
  • Pre-trained networks can be utilized as semantic loss instruments.

Future Work

Based on the experience of working with the above architecture we can think of the following points to proceed in this area.

  • Further improve the quality of the de-noised output
  • Migrate from batch to streaming mode
  • Integrate with communication systems like Skype/Whats App/Hangouts
  • Expanding the existing model to work for multiple speakers



Research paper