### Introduction

In speech processing, an important task encountered most often is the removal of noise from the given audio signal. Most of the time it happens that we are in situations where we want to receive a call (Phone/Skype/Web Ex) but the surroundings are not conducive. For example, taking a phone call in traffic or taking a Web Ex call in Cafeteria or Restaurants. One obvious solution is perform signal processing on the input to remove/reduce the noise factor from the signal. This blog post is the summary of the deep-learning based technique used to de-noise the given audio. The audio signal in study consists of human speech and various background noises. This was a team submission to Infinity 2k19 finals. In order to make this a fun experience we ended up connecting an audio sentiment analyzer at the end in order to predict the emotional state of the speaker. We thought this could be, in future, used to detect emotions of our clients during a call :P.

### Motivation

Aspiring to have a phone call/conference call anywhere, without worrying whether you are in a quiet room or a flea market for a noise-free, content-rich conversation.

### Mathematical Formulation

Given a audio signalxwhich has speechßthat is corrupted by an additive background signalsuch thatn. Our goal is to find de-noising operator g such thatx= ß +ng(x)≈ß.

Before we can get into the details of the experiments, let brush up on high level some important concepts which we will require.

## Key Concepts

### 1. Convolution

**Convolution** is a mathematical operation on two functions (f and g) to produce a third function that expresses how the shape of one is modified by the other.

We see that function g(x)(**Red**) (a known function) operates on function f(x)(**Blue**) (again known function) to produce the resultant function (f*g) (**Black**)(unknown function). Assuming that function ** f** and resultant function

**is known before hand, can we find the g(x) which will produce**

*f*g***This is the underlying principle of convolutional neural networks,**

*f*g.**nothing but curve fitting*. Finding g(x) is nothing but learning the kernel(g(x)) values, which when convoluted on data produces the desired output.

To connect it to real-world example, we apply a Gaussian filter ** g(x)** on a blurred image

**to produce a smooth image**

*f(x)***.**

*(f*g)*Here we see a kernel of size 3x3 (*Gray color*) convoluted on a matrix of size 7x7 by moving one index at time from top-left --> bottom-right fashion also known as stride. In deep learning, the task is to learn the weights of the kernel with the help of back-propagation. In-order to fit more complex curves i.e. to achieve more complex operations, let say de-noising, we apply large number of filter array on single input. But this come with a drawback of increased computation complexity. Is there is a way out ?

### Dilated Convolution

We saw in the previous section how a kernel moves over a given matrix to compute new values. Dilated Convolution is nothing but skipping some of the fixed set of points during convolution. This helps in increasing the receptive field as compared to the standard convolution. At the same time, similar performance can be obtained using fewer number of filters.

For more indepth review check this (article)[https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5]

### Lets Get Started on Building Shabda

### 1. Data Preparation

For the de-noising part, a mixture of two data-sets is used in the experiments, details of which are provided in the tables below. Noises like **Traffic, Cafeteria/Babble, White Noise, **with various strength(0db to 15db in steps of 5) were added in to the clean speech signals to generate the noisy version. This was followed by re-sampling of signal at 16kHz and converting to 32bit depth giving us 11,678 total mono audio files.

Denoising Dataset |
Speaker |
# of files |
---|---|---|

Edinburgh DataShare | 28 Foreign Speakers | 11,572 |

CMU Speaker Dataset | 7 Indian speakers | 106 |

Total | 11,678 |

Sentiment Analysis Dataset |
Speaker |
# of files |
---|---|---|

RAVDESS. | 24 Actors | 1500 |

SAVEE. | 4 Actors | 500 |

Total | 2000 |

For the sentiment analysis experiment, since both the data-sets did not have same number of classes, we ended up using are **Calm, Sad, Fearful, Happy, Angry. **The audio signals from both the data-sets are re-sampled to 44 kHz, in order to get them at same scale and are converted to Mel-Frequency Cepstral Coefficients (MFCC). The MFCC representations are then fed into the neural network as features.

### 2. Training

We train model for both the steps(De-noising & Sentiment Analysis) as separate models.

This diagram describes our full system architecture. We divide the system into two parts. The section above the dotted line is the de-noiser. We use an ** iterator** which loops through the

**comprising of noisy and clean samples, and feeds it to the model. The implementation is in Tensorflow and uses tf-session for driving the full experiment. The model consists of 13 layers which are full CNN (Convolutional Neural Nets). The model was trained for 320 epochs running for total 80 hours.**

*dataset,*The section below the denoiser is the Emotion Analyser. The setup is similar to the one described above. This model is 5 layer convolutional setup. It took 1.6 hours for the model to complete with 200 epochs.

Now lets try to understand how all the parts work together .

The model consists of 13 layers{0,1...12} with the 1st till the second last layer having dilation in the increasing power of i, where i ranges from [0->11]. We have kernel of size 1x3 for all the layers, except last layer which is used to recreate the original signal. We convert the given audio signal into 32 channels, which passed from the start till the second last channel which then maps it to 1 dimensional signal. **SAME** padding is used to keep the dimensions constant.

This setup is called Full Convolution as it neither uses any max-pooling nor dense layers intermittently or regularly. It is pure convolution layers stacked on top of each other with normalization as regularization mechanism.

The variation of normalization used is called Adaptive Batch Normalization. It is the called adaptive because it uses two variables w0 and w1, which are learnt based on batches on every epoch.

Adaptive Batch Normalization = w0*x+w1*batch_norm(x)

The activation used is leaky relu with 0.2 as alpha.

Another interesting gimmick used in the setup is *Deep Feature Loss.*

Consider a neural network already trained on a speech classification task. Let's say classifying various voices of animals to their respective sources. Since the initial layers happen to learn low-level feature representations of given input, these representations tend to be less task-dependent.

Let's denote this pre-trained network as N. Assume N has 32 layers. From this 32 layers, we take few layers lets say 14, call it N[:14]. These are frozen layers and will neither be trained any further nor fine-tuned in anyway whatsoever.

Let's pass a noisy signal through this network N[:14]. When the signal is passed through this network, result will be an activation in each node for every layer in the set of 14. Let us denote output at each of the layers as **feat_current[i],**where i ranges from [0..13].

Lets repeat the same experiment with clean audio sample and denote it as **feat_target[i]**.

Let's bring in our denoiser back and have it's weights associated with each layer denoted by **loss_weights[i]. **

```
for id in range(14)
loss_vec += l1_loss(feat_current[id],feat_target[id])/loss_weights[id]
```

We zip the two outputs (**feat_current**,**feat_target)** of noisy and clean audio sample and perform l1_loss on the corresponding tensors of the ith layer. Subsequently dividing the value with the loss_weights, in order to incorporate the feedback of the de-noising network trained till now.

Plugging all this together, we use the computed value of **loss_vec** as the loss for the speech enhancement network. **loss_weights ** are updated on every epoch and used in the calculation. This is a more powerful approach of calculating loss as it incorporates a feedback from an external entity(pre-trained network),this is called as semantic loss, rather than relying on simple l1_loss.

Sentiment analysis is performed on 1 dimensional features (216) obtained from MFCC. These features are then passed through two convolutional layers with 256 and 128 filters with kernel of size 5 followed by dropout and maxpooling of size 8. Before the flatten layer, dense layer and softmax layer, we have two convolutions of size 128 each.

### 3. Serving

The application works in a offline fashion. That is the full wave file have to be passed into the network for the network to train/predict. It takes around 5 sec for the 10 sec video. In the prediction phase, the audio file runs through pre-processing operations and enters into the de-noiser. The de-noiser clears out the noisiness of the signal and passes it through the Emotion Analyzer. For processing the videos we strip the audio from the video, process the audio and reattach the audio back to the video.

## Demonstration

Below are few samples from which were passed through the full network. First two are video based samples whereas the last one is the audio only example.

## Learning

- Dilated Convolution increases the receptive field without increasing the number of filters.
- Raw data can be directly fed to Neural Net without the use of spectrogram based or statistical signal processing methods.
- Pre-trained networks can be utilized as semantic loss instruments.

## Future Work

Based on the experience of working with the above architecture we can think of the following points to proceed in this area.

- Further
**improve the quality**of the de-noised output - Migrate from batch to
**streaming mode** **Integrate**with communication systems like**Skype/Whats App/Hangouts**- Expanding the existing model to work for multiple speakers

## References

### Repositories

- https://github.com/chaodengusc/DeWave
- https://github.com/zhr1201/deep-clustering
- https://github.com/Gurupradeep/Multi-Scale-Context-Aggregation-by-Dilated-Convolutions
- https://github.com/francoisgermain/SpeechDenoisingWithDeepFeatureLosses
- https://people.xiph.org/~jm/demo/rnnoise/
- https://datashare.is.ed.ac.uk/handle/10283/2791
- http://andrewowens.com/multisensory/
- https://www.endpoint.com/blog/2019/01/08/speech-recognition-with-tensorflow
- https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html

### Research paper

- https://arxiv.org/pdf/1807.05520.pdf
- https://arxiv.org/pdf/1806.10522.pdf
- https://arxiv.org/abs/1508.04306
- https://arxiv.org/pdf/1804.03641.pdf
- https://arxiv.org/abs/1804.03619