nbss-李晓飞实验室

Narrow-band Deep Speech Separation

Abstract

[1] Changsheng Quan, Xiaofei Li. Multi-channel Narrow-band Deep Speech Separation with Full-band Permutation Invariant Training. In ICASSP 2022. [Code], [Pdf], [Examples]

This paper addresses the problem of multi-channel multi-speech separation based on deep learning techniques. In the short time Fourier transform domain, we propose an end-to-end narrow-band network that directly takes as input the multi-channel mixture signals of one frequency, and outputs the separated signals of this frequency. In narrow-band, the spatial information (or inter-channel difference) can well discriminate between speakers at different positions. This information is intensively used in many narrow-band speech separation methods, such as beamforming and clustering of spatial vectors. The proposed network is trained to learn a rule to automatically exploit this information and perform speech separation. Such a rule should be valid for any frequency, thence the network is shared by all frequencies. In addition, a full-band permutation invariant training criterion is proposed to solve the frequency permutation problem encountered by most narrow-band methods. Experiments show that, by focusing on deeply learning the narrow-band information, the proposed method outperforms the oracle beamforming method and the state-of-the-art deep learning based method.

Abstract

[2] Changsheng Quan, Xiaofei Li. Multichannel Speech Separation with Narrow-band Conformer. In Interspeech 2022. [Code], [Pdf]

This work proposes a multichannel speech separation method with narrow-band Conformer (named NBC). The network is trained to learn to automatically exploit narrow-band speech separation information, such as spatial vector clustering of multiple speakers. Specifically, in the short-time Fourier transform (STFT) domain, the network processes each frequency independently, and is shared by all frequencies. For one frequency, the network inputs the STFT coefficients of multichannel mixture signals, and predicts the STFT coefficients of separated speech signals. Clustering of spatial vectors shares a similar principle with the self-attention mechanism in the sense of computing the similarity of vectors and then aggregating similar vectors. Therefore, Conformer would be especially suitable for the present problem. Experiments show that the proposed narrow-band Conformer achieves better speech separation performance than other state-of-the-art methods by a large margin.

Abstract

[3] Changsheng Quan, Xiaofei Li. NBC2: Multichannel Speech Separation with Revised Narrow-band Conformer. arXiv preprint arXiv:2212.02076. [Code], [Pdf], [Examples]

This work proposes a multichannel narrow-band speech separation network. In the short-time Fourier transform (STFT) domain, the proposed network processes each frequency independently, and all frequencies use a shared network. For each frequency, the network performs end-to-end speech separation, namely taking as input the STFT coefficients of microphone signals, and predicting the separated STFT coefficients of multiple speakers. The proposed network learns to cluster the frame-wise spatial/steering vectors that belong to different speakers. It is mainly composed of three components. First, a self-attention network. Clustering of spatial vectors shares a similar principle with the self-attention mechanism in the sense of computing the similarity of vectors and then aggregating similar vectors. Second, a convolutional feed-forward network. The convolutional layers are employed for signal smoothing and reverberation processing. Third, a novel hidden-layer normalization method, i.e. group batch normalization (GBN), is especially designed for the proposed narrow-band network to maintain the distribution of hidden units over frequencies. Overall, the proposed network is named NBC2, as it is a revised version of our previous NBC (narrow-band conformer) network. Experiments show that 1) the proposed network outperforms other state-of-the-art methods by a large margin, 2) the proposed GBN improves the signal-to-distortion ratio by 3 dB, relative to other normalization methods, such as batch/layer/group normalization, 3) the proposed narrow-band network is spectrum-agnostic, as it does not learn spectral patterns, and 4) the proposed network is indeed performing frame clustering (demonstrated by the attention maps).

Examples

Please open this page with Edge or Chrome, and not use Firefox. The audio playing is problematic in Firefox.

mix spk1 spk2

Full Overlapped

Id	Mix	Channels	FaSNet-TAC [4]	Beam-Guided TasNet [5] (iter=2)	SepFormer [6]	NB-BLSTM [1] (prop.)	NBC2 [3] (prop.)
Id	Mix	Channels	FaSNet-TAC [4]	Beam-Guided TasNet [5] (iter=2)	SepFormer [6]	NB-BLSTM [1] (prop.)	small	large
1		1
		2
		4
		8
2		1
		2
		4
		8
3		1
		2
		4
		8
4		1
		2
		4
		8
5		1
		2
		4
		8
6		1
		2
		4
		8

Head-tail Overlapped

Id	Mix	Channels	FaSNet-TAC [4]	Beam-Guided TasNet [5] (iter=2)	SepFormer [6]	NB-BLSTM [1] (prop.)	NBC2 [3] (prop.)
Id	Mix	Channels	FaSNet-TAC [4]	Beam-Guided TasNet [5] (iter=2)	SepFormer [6]	NB-BLSTM [1] (prop.)	small	large
1		1
		2
		4
		8
2		1
		2
		4
		8
3		1
		2
		4
		8
4		1
		2
		4
		8
5		1
		2
		4
		8
6		1
		2
		4
		8

Mid Overlapped

Id	Mix	Channels	FaSNet-TAC [4]	Beam-Guided TasNet [5] (iter=2)	SepFormer [6]	NB-BLSTM [1] (prop.)	NBC2 [3] (prop.)
Id	Mix	Channels	FaSNet-TAC [4]	Beam-Guided TasNet [5] (iter=2)	SepFormer [6]	NB-BLSTM [1] (prop.)	small	large
1		1
		2
		4
		8
2		1
		2
		4
		8
3		1
		2
		4
		8
4		1
		2
		4
		8
5		1
		2
		4
		8
6		1
		2
		4
		8

Start-or-end Overlapped

Id	Mix	Channels	FaSNet-TAC [4]	Beam-Guided TasNet [5] (iter=2)	SepFormer [6]	NB-BLSTM [1] (prop.)	NBC2 [3] (prop.)
Id	Mix	Channels	FaSNet-TAC [4]	Beam-Guided TasNet [5] (iter=2)	SepFormer [6]	NB-BLSTM [1] (prop.)	small	large
1		1
		2
		4
		8
2		1
		2
		4
		8
3		1
		2
		4
		8
4		1
		2
		4
		8
5		1
		2
		4
		8
6		1
		2
		4
		8

Source Code

These works are open sourced at github, see [Code]. If you like this work and are willing to cite us, please use:

@inproceedings{quan_multi-channel_2022,
    title = {Multi-channel {Narrow}-band {Deep} {Speech} {Separation} with {Full}-band {Permutation} {Invariant} {Training}},
    booktitle = {{ICASSP}},
    author = {Quan, Changsheng and Li, Xiaofei},
    year = {2022},
}

@inproceedings{quan_NBC_2022,
    title = {Multichannel {Speech} {Separation} with {Narrow}-band {Conformer}},
    booktitle = {InterSpeech},
    author = {Quan, Changsheng and Li, Xiaofei},
    year = {2022},
}

and

@article{quan_NBC2_2022,
    title = {NBC2: Multichannel Speech Separation with Revised Narrow-band Conformer},
    journal = {arXiv preprint arXiv:2212.02076},
    author = {Quan, Changsheng and Li, Xiaofei},
    year = {2022},
}

References

[1] Changsheng Quan, Xiaofei Li. Multi-channel Narrow-band Deep Speech Separation with Full-band Permutation Invariant Training. In ICASSP 2022.
[2] Changsheng Quan, Xiaofei Li. Multichannel Speech Separation with Narrow-band Conformer. In Interspeech 2022.
[3] Changsheng Quan, Xiaofei Li. NBC2: Multichannel Speech Separation with Revised Narrow-band Conformer. arXiv preprint arXiv:2212.02076.
[4] Yi Luo, Zhuo Chen, Nima Mesgarani, and Takuya Yoshioka. End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation. In ICASSP 2020.
[5] H. Chen, Y. Yang, F. Dang, and P. Zhang, Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output. in Interspeech, 2022,
[6] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong. Attention Is All You Need In Speech Separation. In ICASSP 2021.