Given a video (frames + audio track), the objective is to isolate and enhance the speech of a desired speaker from a mixture of sounds such as other speakers and background noise.
- A multi-stream (audio + video) neural network based architecture is used as shown above.
- The visual streams take as input thumbnails of detected faces in each frame in the video, and the audio stream takes as input the video’s soundtrack, containing a mixture of speech and background noise.
- The visual streams extract face embeddings for each thumbnail using a pretrained face recognition model, then learn a visual feature using a dilated convolutional NN.
- The audio stream first computes the STFT of the input signal to obtain a spectrogram, and then learns an audio representation using a similar dilated convolutional NN.
- A joint, audio-visual representation is then created by concatenating the learned visual and audio features, and is subsequently further processed using a bidirectional LSTM and three fully connected layers.
- The network outputs a complex spectrogram mask for each speaker, which is multiplied by the noisy input, and converted back to waveforms to obtain an isolated speech signal for each speaker.
Data set details
- AVSpeech dataset - Synthetic data set is generated as follows: Videos consisting of single speaker with clear faces and voice is extracted from Youtube and this clean data is used to generate “synthetic cocktail parties” -- mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise.
A unique aspect of the technique is in combining both the auditory and visual signals of an input video to separate the speech. Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person.