Publications-李晓飞实验室

Home People Research Publications

Github Page

https://github.com/Audio-WestlakeU

Preprints

VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification [pdf] [research page] [code]
Pengyu Wang, Ying Fang, Xiaofei Li
LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction [pdf] [code]
Di Liang, Xiaofei Li
Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR [pdf] [research page]
Rui Zhou, Xian Li, Ying Fang, Xiaofei Li
Narrow-band Deep Filtering for Multichannel Speech Enhancement [pdf] [research page] [code]
Xiaofei Li, Radu Horaud

2025

Mamba for Streaming ASR Combined with Unimodal Aggregation [pdf] [code]
Ying Fang, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025.

2024

RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization [pdf] [code]
Bing Yang, Changsheng Quan, Yabo Wang, Pengyu Wang, Yujie Yang, Ying Fang, Nian Shao, Hui Bu, Xin Xu, Xiaofei Li
NeurlPS 2024.
Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer [pdf] [code]
Bing Yang, Xiaofei Li
IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers [pdf] [code]
Changsheng Quan, Xiaofei Li
IEEE Signal Precessing Letters.
SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation [pdf] [code]
Changsheng Quan, Xiaofei Li
IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks [pdf] [code]
Xian Li, Nian Shao, Xiaofei Li
IEEE/ACM Transactions on Audio, Speech, and Language Processing.
RVAE-EM: Generative speech dereverberation based on recurrent variational auto-encoder and convolutive transfer function [pdf] [code]
Pengyu Wang, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024.
Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors [pdf] [code]
Di Liang, Nian Shao, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024.
Fine-tune the pretrained ATST model for sound event detection [pdf] [code]
Nian Shao, Xian Li, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024.
Unimodal Aggregation for CTC-based Speech Recognition [pdf] [code]
Ying Fang, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024.
IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization [pdf] [code]
Yabo Wang, Bing Yang, Xiaofei Li
IEEE/ACM Transactions on Audio, Speech, and Language Processing.

2023

McNet: Fuse Multiple Cues for Multichannel Speech Enhancement [pdf] [research page] [code]
Yujie Yang, Changsheng Quan, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023.
Speech Dereverberation with a Reverberation Time Shortening Target [pdf] [code]
Rui Zhou, Wenye Zhu, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023.
DVQVC: An Unsupervised Zero-Shot Voice Conversion Framework [pdf] [demo]
Dayong Li, Xian Li, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023.
FN-SSL: Full-Band and Narrow-Band Fusion for Sound Source Localization [pdf] [code]
Yabo Wang, Bing Yang, Xiaofei Li
Interspeech 2023.

2022

Multichannel Speech Separation with Narrow-band Conformer [pdf] [research page] [code]
Changsheng Quan, Xiaofei Li
Interspeech, 2022.
ATST: Audio Representation Learning with Teacher-Student Transformer [pdf] [code]
Xian Li, Xiaofei Li
Interspeech, 2022.
RCT: Random Consistency Training for Semi-supervised Sound Event Detection [pdf] [code]
Nian Shao, Erfan Loweimi, Xiaofei Li
Interspeech, 2022.
Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation
Feifei Xiong, Weiguang Chen, Pengyu Wang, Xiaofei Li and Jinwei Feng
Interspeech, 2022.
Multi-channel Narrow-band Deep Speech Separation with Full-band Permutation Invariant Training [pdf] [research page] [code]
Changsheng Quan, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022.
SRP-DNN: Learning Direct-path Phase Difference For Multiple Moving Sound Source Localization [pdf] [code]
Bing Yang, Hong Liu, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022.
Connecting the Dots in Self-Supervised Learning: A Brief Survey for Beginners [pdf]
Peng Fei Fang, Xian Li, Yang Yan, Shuai Zhang, Qi Yue Kang, Xiao Fei Li, Zhen Zhong Lan
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 37(3): 507–526, May 2022.

2021

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization [pdf] [code]
Bing Yang, Hong Liu, Xiaofei Li
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, pp. 3491–3503, 2021.
Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement [pdf] [code]
Siyuan Zhang and Xiaofei Li
Interspeech, 2021.
AcousticFusion: Fusing Sound Source Localization to Visual SLAM in dynamic environments [pdf]
Tianwei Zhang, Huayan Zhang, Xiaofei Li, Junfeng Chen, Tin Lun Lam, Sethu Vijayakumar
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement [pdf] [research page] [code]
Xiang Hao, Xiangdong Su, Radu Horaud, Xiaofei Li
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021.
Supervised Direct-Path Relative Transfer Function Learning for Binaural Sound Source Localization [pdf]
Bing Yang, Xiaofei Li, Hong Liu
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021.
Enhancing Direct‐path Relative Transfer Function using Deep Neural Network for Robust Sound Source Localization [pdf]
Bing Yang, Runwei Ding, Yutong Ban, Xiaofei Li, Hong Liu
CAAI Transactions on Intelligence Technology, 2021.

2020

A Covert Ultrasonic Phone-to-Phone Communication Scheme
Liming Shi, Limin Yu, Kaizhu Huang, Xu Zhu, Zhi Wang, Xiaofei Li, Wenwu Wang, Xinheng Wang
International Conference on Collaborative Computing: Networking, Applications and Worksharing, pp. 36-48, 2020.
Online Monaural Speech Enhancement Using Delayed Subband LSTM [pdf] [research page] [audio examples]
Xiaofei Li and Radu Horaud
Interspeech 2020.
Sub-band Knowledge Distillation Framework for Speech Enhancement [pdf]
Xiang Hao, Shixue Wen, Xiangdong Su, Yun Liu, Guanglai Gao, Xiaofei Li
Interspeech 2020.

2019

Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering [pdf] [audio examples] [matlab code]
Xiaofei Li, Laurent Girin, Sharon Gannot, Radu Horaud
IEEE/ACM Transactions on Audio, Speech and Language Processing, 27 (9), pp. 1365 – 1377, 2019.
Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments [pdf] [research page] [matlab code]
Xiaofei Li, Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, Radu Horaud
IEEE Journal of Selected Topics in Signal Processing, 13 (1), pp. 88 – 103, 2019.
Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function [pdf] [matlab code]
Xiaofei Li, Laurent Girin, Sharon Gannot, Radu Horaud
IEEE/ACM Transactions on Audio, Speech and Language Processing, 27 (3), pp. 645 – 659, 2019.
Audio-noise Power Spectral Density Estimation Using Long Short-term Memory [pdf] [test python code and data]
Xiaofei Li, Simon Leglaive, Laurent Girin, Radu Horaud
IEEE Signal Processing Letters, 26 (6), pp. 918 – 922, 2019.
Expectation-Maximization for Speech Source Separation using Convolutive Transfer Function [pdf] [matlab code]
Xiaofei Li, Laurent Girin, Radu Horaud
CAAI Transactions on Intelligent Technologies, 4 (1), pp. 47 – 53, 2019.
Multiple Sound Source Counting and Localization Based on TF-Wise Spatial Spectrum Clustering
Bing Yang, Hong Liu, Cheng Pang, Xiaofei Li
IEEE/ACM Transactions on Audio, Speech and Language Processing, 27 (8), pp. 1241 – 1255, 2019.
Multitask Learning of Time-Frequency CNN for Sound Source Localization [pdf]
Cheng Pang, Hong Liu, Xiaofei Li
IEEE Access, vol.7, pp. 40725 – 40737, 2019.
Multichannel Speech Enhancement Based on Time-frequency Masking Using Subband Long Short-Term Memory [pdf] [audio examples][code]
Xiaofei Li and Radu Horaud
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct 2019, New Paltz, NY, United States.
Audio-Visual Variational Fusion for Multi-Person Tracking with Robots [pdf ]
Xavier Alameda-Pineda, Soraya Arias, Yutong Ban, Guillaume Delorme, Laurent Girin, Radu Horaud, Xiaofei Li, Bastien Mourgue, Guillaume Sarrazin
ACMMM 2019 – 27th ACM International Conference on Multimedia, Oct 2019, Nice, France. pp.1059-1061.

2018

Audio source separation into the wild [pdf]
Laurent Girin, Sharon Gannot, Xiaofei Li
Multimodal Behavior Analysis in the Wild, Academic Press (Elsevier), Computer Vision and Pattern Recognition, 〈10.1016/B978-0-12-814601-9.00022-5〉, pp. 53-78, 2018.
Multichannel Identification and Nonnegative Equalization for Dereverberation and Noise Reduction based on Convolutive Transfer Function [pdf] [audio examples] [matlab code]
Xiaofei Li, Sharon Gannot, Laurent Girin, Radu Horaud
IEEE/ACM Transactions on Audio, Speech and Language Processing, 26 (10), pp. 1755 – 1768, 2018.
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion [research page]
Israel D. Gebru, Silèye Ba, Xiaofei Li, Radu Horaud
IEEE Transactions on pattern analysis and machine intelligence, 40 (5), pp. 1086 – 1099, 2018.
Online Localization of Multiple Moving Speakers in Reverberant Environments [pdf] [matlab code]
Xiaofei Li, Bastien Mourgue, Laurent Girin, Sharon Gannot and Radu Horaud
IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), July 2018, Sheffield, UK.
Multisource MINT Using the Convolutive Transfer Function [pdf] [matlab code]
Xiaofei Li, Sharon Gannot, Laurent Girin, Radu Horaud
IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Apr 2018, Calgary, Canada.
A Cascaded Multiple-Speaker Localization and Tracking System [pdf] [research page] [matlab code]
Xiaofei Li, Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, Radu Horaud
Proceedings of the LOCATA Challenge Workshop – a satellite event of IWAENC 2018, Sep 2018, Tokyo, Japan. pp.1-5.
Accounting for Room Acoustics in Audio-Visual Multi-Speaker Tracking
Yutong Ban, Xiaofei Li, Xavier Alameda-Pineda, Laurent Girin, Radu Horaud
IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), Apr 2018, Calgary, Canada.

2017

Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization with Spatial Sparsity Regularization [pdf] [research page]
Xiaofei Li, Laurent Girin, Radu Horaud and Sharon Gannot
IEEE/ACM Transactions on Audio, Speech and Language Processing, 25 (10), pp. 1997 – 2012, 2017.
Binaural Sound Localization Based on Reverberation Weighting and Generalized Parametric Mapping
Cheng Pang, Hong Liu, Jie Zhang and Xiaofei Li
IEEE/ACM Transactions on Audio, Speech and Language Processing, 25 (8), pp. 1618 – 1632, 2017.
An EM algorithm for audio source separation based on the convolutive transfer function [pdf] [matlab code]
Xiaofei Li, Laurent Girin, Radu Horaud
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct 2017, New Paltz, NY, United States.
Audio Source Separation based on Convolutive Transfer Function and Frequency-Domain Lasso Optimization [pdf ] [matlab code]
Xiaofei Li, Laurent Girin, Radu Horaud
IEEE International Conference on Audio, Speech and Signal Processing (ICASSP), Mar 2017, New Orleans, United States.

2016

Estimation of the Direct-Path Relative Transfer Function for Supervised Sound-Source Localization [pdf] [matlab code] [research page]
Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot
IEEE/ACM Transactions on Audio, Speech and Language Processing, 2016, 24 (11), pp. 2171 – 2186.
A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion [pdf]
Pingping Wu, Hong Liu, Xiaofei Li, Ting Fan, Xuewu Zhang
IEEE Transactions on Multimedia 18(3), pp. 326-338, 2016.
Reverberant Sound Localization with a Robot Head Based on Direct-Path Relative Transfer Function [pdf]
Xiaofei Li, Laurent Girin, Fabien Badeig, Radu Horaud
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, Daejeon, South Korea.
Voice Activity Detection Based on Statistical Likelihood Ratio With Adaptive Thresholding [pdf]
Xiaofei Li, Radu Horaud, Laurent Girin, Sharon Gannot
International Workshop on Acoustic Signal Enhancement (IWAENC), Sep 2016, Xi’an, China.
Non-Stationary Noise Power Spectral Density Estimation Based on Regional Statistics [pdf] [matlab code]
Xiaofei Li, Laurent Girin, Sharon Gannot, Radu Horaud
IEEE International Conference on Audio, Speech and Signal Processing (ICASSP), Mar 2016, Shangai, China.

2015

A Distributed Architecture for Interacting with NAO [pdf]
Fabien Badeig, Quentin Pelorson, Soraya Arias, Vincent Drouard, Israel Dejene Gebru, Xiaofei Li, Georgios Evangelidis, Radu Horaud
International Conference on Multimodal Interaction (ICMI), Nov 2015, Seattle, WA, United States.
Local Relative Transfer Function for Sound Source Localization [pdf]
Xiaofei Li, Radu Horaud, Laurent Girin, Sharon Gannot
The European Signal Processing Conference (Eusipco), Aug 2015, Nice, France.
Estimation of Relative Transfer Function in the Presence of Stationary Noise Based on Segmental Power Spectral Density Matrix Subtraction [pdf] [matlab code]
Xiaofei Li, Laurent Girin, Radu Horaud, Sharon Gannot
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2015, Brisbane, Australia.

Before 2015

Sound Source Localization for HRI Using FOC-based Time Difference Feature and Spatial Grid Matching [pdf]
Xiaofei Li and Hong Liu
IEEE Transactions on Cybernetics, 43 (4), pp. 1199-1212, 2013.
Real-time Sound Source Localization for Mobile Robot Based on Guided Spectral-Temporal Position Method [pdf]
Xiaofei Li, Miao Shen, Wenmin Wang and Hong Liu
International Journal of Advanced Robotic Systems, 2012, vol.9, 78:2012.
A survey of sound source localization for robot audition
Xiaofei Li and Hong Liu
CAAI Transactions on Intelligent Systems, 7 (1), pp. 9-20, 2012. (in Chinese)
A Two-Layer Probabilistic Model Based on Time-Delay Compensation for Binaural Sound Localization [pdf]
Hong Liu, Zhuo Fu and Xiaofei Li
IEEE International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany, 6-10, May, 2013.
Time Delay Estimation for Speech Signal Based on FOC-Spectrum [pdf]
Hong Liu and Xiaofei Li
International Conference on INTERSPEECH, Portland, Oregon, USA, 2012:1732-1735.
Sound Source Localization for Human-Robot Interaction Based on Spatial Distribution of Time Difference Feature and Grid Matching [pdf]
Xiaofei Li, Hong Liu and Xuesong Yang
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2011.
A Selection Method of Speech Vocabulary for Human-Robot Speech Interaction [pdf]
Hong Liu and Xiaofei Li
IEEE International Conference on Systems, Man and Cybernetics (SMC), Istanbul, Turkey, 2010:2243-2248.