Problem Definition

Given a video (frames + audio track), the objective is to isolate and enhance the speech of a desired speaker from a mixture of sounds such as other speakers and background noise.



  • A multi-stream (audio + video) neural network based architecture is used as shown above.
  • The visual streams take as input thumbnails of detected faces in each frame in the video, and the audio stream takes as input the video’s soundtrack, containing a mixture of speech and background noise.
  • The visual streams extract face embeddings for each thumbnail using a pretrained face recognition model, then learn a visual feature using a dilated convolutional NN.
  • The audio stream first computes the STFT of the input signal to obtain a spectrogram, and then learns an audio representation using a similar dilated convolutional NN.
  • A joint, audio-visual representation is then created by concatenating the learned visual and audio features, and is subsequently further processed using a bidirectional LSTM and three fully connected layers.
  • The network outputs a complex spectrogram mask for each speaker, which is multiplied by the noisy input, and converted back to waveforms to obtain an isolated speech signal for each speaker.


Data set details

  • AVSpeech dataset - Synthetic data set is generated as follows: Videos consisting of single speaker with clear faces and voice is extracted from Youtube and this clean data is used to generate “synthetic cocktail parties” -- mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise.


A unique aspect of the technique is in combining both the auditory and visual signals of an input video to separate the speech. Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person.