Zexu Pan

Zexu Pan
  • Biography

    Zexu joined MERL after completing his Ph.D. at the National University of Singapore in 2023. His research interests are artificial intelligence, deep learning, and their applications on but not limited to speech processing, such as multi-modal speech enhancement, speaker extraction, speaker diarization, robust automatic speech recognition, multi-modal representation learning, and auditory attention detection.

  • Recent News & Events

    •  EVENT    MERL Contributes to ICASSP 2024
      Date: Sunday, April 14, 2024 - Friday, April 19, 2024
      Location: Seoul, South Korea
      MERL Contacts: Petros T. Boufounos; François Germain; Chiori Hori; Sameer Khurana; Toshiaki Koike-Akino; Jonathan Le Roux; Hassan Mansour; Zexu Pan; Kieran Parsons; Joshua Rapp; Anthony Vetro; Pu (Perry) Wang; Gordon Wichern; Ryoma Yataka
      Research Areas: Artificial Intelligence, Computational Sensing, Machine Learning, Robotics, Signal Processing, Speech & Audio
      Brief
      • MERL has made numerous contributions to both the organization and technical program of ICASSP 2024, which is being held in Seoul, Korea from April 14-19, 2024.

        Sponsorship and Awards

        MERL is proud to be a Bronze Patron of the conference and will participate in the student job fair on Thursday, April 18. Please join this session to learn more about employment opportunities at MERL, including openings for research scientists, post-docs, and interns.

        MERL is pleased to be the sponsor of two IEEE Awards that will be presented at the conference. We congratulate Prof. Stéphane G. Mallat, the recipient of the 2024 IEEE Fourier Award for Signal Processing, and Prof. Keiichi Tokuda, the recipient of the 2024 IEEE James L. Flanagan Speech and Audio Processing Award.

        Jonathan Le Roux, MERL Speech and Audio Senior Team Leader, will also be recognized during the Awards Ceremony for his recent elevation to IEEE Fellow.

        Technical Program

        MERL will present 13 papers in the main conference on a wide range of topics including automated audio captioning, speech separation, audio generative models, speech and sound synthesis, spatial audio reproduction, multimodal indoor monitoring, radar imaging, depth estimation, physics-informed machine learning, and integrated sensing and communications (ISAC). Three workshop papers have also been accepted for presentation on audio-visual speaker diarization, music source separation, and music generative models.

        Perry Wang is the co-organizer of the Workshop on Signal Processing and Machine Learning Advances in Automotive Radars (SPLAR), held on Sunday, April 14. It features keynote talks from leaders in both academia and industry, peer-reviewed workshop papers, and lightning talks from ICASSP regular tracks on signal processing and machine learning for automotive radar and, more generally, radar perception.

        Gordon Wichern will present an invited keynote talk on analyzing and interpreting audio deep learning models at the Workshop on Explainable Machine Learning for Speech and Audio (XAI-SA), held on Monday, April 15. He will also appear in a panel discussion on interpretable audio AI at the workshop.

        Perry Wang also co-organizes a two-part special session on Next-Generation Wi-Fi Sensing (SS-L9 and SS-L13) which will be held on Thursday afternoon, April 18. The special session includes papers on PHY-layer oriented signal processing and data-driven deep learning advances, and supports upcoming 802.11bf WLAN Sensing Standardization activities.

        Petros Boufounos is participating as a mentor in ICASSP’s Micro-Mentoring Experience Program (MiME).

        About ICASSP

        ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 3000 participants.
    •  
  • Awards

    •  AWARD    MERL team wins the Audio-Visual Speech Enhancement (AVSE) 2023 Challenge
      Date: December 16, 2023
      Awarded to: Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois Germain, Sameer Khurana, Chiori Hori, and Jonathan Le Roux
      MERL Contacts: François Germain; Chiori Hori; Sameer Khurana; Jonathan Le Roux; Zexu Pan; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL's Speech & Audio team ranked 1st out of 12 teams in the 2nd COG-MHEAR Audio-Visual Speech Enhancement Challenge (AVSE). The team was led by Zexu Pan, and also included Gordon Wichern, Yoshiki Masuyama, Francois Germain, Sameer Khurana, Chiori Hori, and Jonathan Le Roux.

        The AVSE challenge aims to design better speech enhancement systems by harnessing the visual aspects of speech (such as lip movements and gestures) in a manner similar to the brain’s multi-modal integration strategies. MERL’s system was a scenario-aware audio-visual TF-GridNet, that incorporates the face recording of a target speaker as a conditioning factor and also recognizes whether the predominant interference signal is speech or background noise. In addition to outperforming all competing systems in terms of objective metrics by a wide margin, in a listening test, MERL’s model achieved the best overall word intelligibility score of 84.54%, compared to 57.56% for the baseline and 80.41% for the next best team. The Fisher’s least significant difference (LSD) was 2.14%, indicating that our model offered statistically significant speech intelligibility improvements compared to all other systems.
    •  
    See All Awards for MERL
  • Research Highlights

  • MERL Publications

    •  Pan, Z., Wichern, G., Germain, F.G., Subramanian, A., Le Roux, J., "Late Audio-Visual Fusion for In-The-Wild Speaker Diarization", Hands-free Speech Communication and Microphone Arrays (HSCMA), April 2024.
      BibTeX TR2024-029 PDF
      • @inproceedings{Pan2024apr,
      • author = {Pan, Zexu and Wichern, Gordon and Germain, François G and Subramanian, Aswin and Le Roux, Jonathan},
      • title = {Late Audio-Visual Fusion for In-The-Wild Speaker Diarization},
      • booktitle = {Hands-free Speech Communication and Microphone Arrays (HSCMA)},
      • year = 2024,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2024-029}
      • }
    •  Pan, Z., Wichern, G., Germain, F.G., Khurana, S., Le Roux, J., "NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2024.
      BibTeX TR2024-025 PDF
      • @inproceedings{Pan2024mar,
      • author = {Pan, Zexu and Wichern, Gordon and Germain, François G and Khurana, Sameer and Le Roux, Jonathan},
      • title = {NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2024,
      • month = mar,
      • url = {https://www.merl.com/publications/TR2024-025}
      • }
    •  Bralios, D., Wichern, G., Germain, F.G., Pan, Z., Khurana, S., Hori, C., Le Roux, J., "Generation or Replication: Auscultating Audio Latent Diffusion Models", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2024.
      BibTeX TR2024-027 PDF
      • @inproceedings{Bralios2024mar,
      • author = {Bralios, Dimitrios and Wichern, Gordon and Germain, François G and Pan, Zexu and Khurana, Sameer and Hori, Chiori and Le Roux, Jonathan},
      • title = {Generation or Replication: Auscultating Audio Latent Diffusion Models},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2024,
      • month = mar,
      • url = {https://www.merl.com/publications/TR2024-027}
      • }
    •  Masuyama, Y., Wichern, G., Germain, F.G., Pan, Z., Khurana, S., Hori, C., Le Roux, J., "NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2024.
      BibTeX TR2024-026 PDF
      • @inproceedings{Masuyama2024mar,
      • author = {Masuyama, Yoshiki and Wichern, Gordon and Germain, François G and Pan, Zexu and Khurana, Sameer and Hori, Chiori and Le Roux, Jonathan},
      • title = {NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2024,
      • month = mar,
      • url = {https://www.merl.com/publications/TR2024-026}
      • }
    •  Pan, Z., Wichern, G., Masuyama, Y., Germain, F.G., Khurana, S., Hori, C., Le Roux, J., "Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), DOI: 10.1109/​ASRU57964.2023.10389618, December 2023.
      BibTeX TR2023-152 PDF
      • @inproceedings{Pan2023dec2,
      • author = {Pan, Zexu and Wichern, Gordon and Masuyama, Yoshiki and Germain, François G and Khurana, Sameer and Hori, Chiori and Le Roux, Jonathan},
      • title = {Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2023,
      • month = dec,
      • doi = {10.1109/ASRU57964.2023.10389618},
      • isbn = {979-8-3503-0689-7},
      • url = {https://www.merl.com/publications/TR2023-152}
      • }
    See All MERL Publications for Zexu
  • Other Publications

    •  Yidi Jiang, Ruijie Tao, Zexu Pan and Haizhou Li, "Target Active Speaker Detection with Audio-visual Cues", Proc. INTERSPEECH, 2023.
      BibTeX
      • @Inproceedings{jiang2023target,
      • author = {Jiang, Yidi and Tao, Ruijie and Pan, Zexu and Li, Haizhou},
      • title = {Target Active Speaker Detection with Audio-visual Cues},
      • booktitle = {Proc. INTERSPEECH},
      • year = 2023
      • }
    •  Junjie Li, Meng Ge, Zexu Pan, Rui Cao, Longbiao Wang, Jianwu Dang and Shiliang Zhang, "Rethinking the Visual Cues in Audio-visual Speaker Extraction", Proc. INTERSPEECH, 2023.
      BibTeX
      • @Inproceedings{li2023rethinking,
      • author = {Li, Junjie and Ge, Meng and Pan, Zexu and Cao, Rui and Wang, Longbiao and Dang, Jianwu and Zhang, Shiliang},
      • title = {Rethinking the Visual Cues in Audio-visual Speaker Extraction},
      • booktitle = {Proc. INTERSPEECH},
      • year = 2023
      • }
    •  Zexu Pan, Wupeng Wang, Marvin Borsdorf and Haizhou Li, "ImagineNet: Target Speaker Extraction with Intermittent Visual Cue Through Embedding Inpainting", Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023.
      BibTeX
      • @Inproceedings{pan2023imaginenet,
      • author = {Pan, Zexu and Wang, Wupeng and Borsdorf, Marvin and Li, Haizhou},
      • title = {ImagineNet: Target Speaker Extraction with Intermittent Visual Cue Through Embedding Inpainting},
      • booktitle = {Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.},
      • year = 2023
      • }
    •  Tingting Wang, Zexu Pan, Meng Ge, Zhen Yang and Haizhou Li, "Time-Domain Speech Separation Networks With Graph Encoding Auxiliary", IEEE Signal Processing Letters, Vol. 30, pp. 110-114, 2023.
      BibTeX
      • @Article{wang2023graph,
      • author = {Wang, Tingting and Pan, Zexu and Ge, Meng and Yang, Zhen and Li, Haizhou},
      • title = {Time-Domain Speech Separation Networks With Graph Encoding Auxiliary},
      • journal = {IEEE Signal Processing Letters},
      • year = 2023,
      • volume = 30,
      • pages = {110--114}
      • }
    •  Zexu Pan, Ruijie Tao, Chenglin Xu and Haizhou Li, "Selective Listening by Synchronizing Speech with Lips", IEEE/ACM Trans. Audio, Speech, Lang. Process., Vol. 30, pp. 1650-1664, 2022.
      BibTeX
      • @Article{pan2021reentry,
      • author = {Pan, Zexu and Tao, Ruijie and Xu, Chenglin and Li, Haizhou},
      • title = {Selective Listening by Synchronizing Speech with Lips},
      • journal = {IEEE/ACM Trans. Audio, Speech, Lang. Process.},
      • year = 2022,
      • volume = 30,
      • pages = {1650--1664}
      • }
    •  Zexu Pan, Meng Ge and Haizhou Li, "A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction", Proc. INTERSPEECH, 2022, pp. 1786-1790.
      BibTeX
      • @Inproceedings{pan2022hybrid,
      • author = {Pan, Zexu and Ge, Meng and Li, Haizhou},
      • title = {A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction},
      • booktitle = {Proc. INTERSPEECH},
      • year = 2022,
      • pages = {1786--1790}
      • }
    •  Zexu Pan, Xinyuan Qian and Haizhou Li, "Speaker Extraction with Co-Speech Gestures Cue", IEEE Signal Processing Letters, Vol. 29, pp. 1467-1471, 2022.
      BibTeX
      • @Article{pan2022seg,
      • author = {Pan, Zexu and Qian, Xinyuan and Li, Haizhou},
      • title = {Speaker Extraction with Co-Speech Gestures Cue},
      • journal = {IEEE Signal Processing Letters},
      • year = 2022,
      • volume = 29,
      • pages = {1467--1471}
      • }
    •  Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang and Jianwu Dang, "VCSE: Time-Domain Visual-Contextual Speaker Extraction Network", Proc. INTERSPEECH, 2022, pp. 906-910.
      BibTeX
      • @Inproceedings{tavcse2022,
      • author = {Li, Junjie and Ge, Meng and Pan, Zexu and Wang, Longbiao and Dang, Jianwu},
      • title = {VCSE: Time-Domain Visual-Contextual Speaker Extraction Network},
      • booktitle = {Proc. INTERSPEECH},
      • year = 2022,
      • pages = {906--910}
      • }
    •  Zexu Pan, Meng Ge and Haizhou Li, "USEV: Universal Speaker Extraction With Visual Cue", IEEE/ACM Trans. Audio, Speech, Lang. Process., Vol. 30, pp. 3032-3045, 2022.
      BibTeX
      • @Article{usev21,
      • author = {Pan, Zexu and Ge, Meng and Li, Haizhou},
      • title = {USEV: Universal Speaker Extraction With Visual Cue},
      • journal = {IEEE/ACM Trans. Audio, Speech, Lang. Process.},
      • year = 2022,
      • volume = 30,
      • pages = {3032--3045}
      • }
    •  Zexu Pan, Ruijie Tao, Chenglin Xu and Haizhou Li, "Muse: Multi-Modal Target Speaker Extraction with Visual Cues", Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 6678-6682.
      BibTeX
      • @Inproceedings{pan2020muse,
      • author = {Pan, Zexu and Tao, Ruijie and Xu, Chenglin and Li, Haizhou},
      • title = {Muse: Multi-Modal Target Speaker Extraction with Visual Cues},
      • booktitle = {Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.},
      • year = 2021,
      • pages = {6678--6682}
      • }
    •  Xinyuan Qian, Maulik Madhavi, Zexu Pan, Jiadong Wang and Haizhou Li, "Multi-target DoA Estimation with an Audio-visual Fusion Mechanism", Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 4280-4284.
      BibTeX
      • @Inproceedings{qian2021multi,
      • author = {Qian, Xinyuan and Madhavi, Maulik and Pan, Zexu and Wang, Jiadong and Li, Haizhou},
      • title = {Multi-target DoA Estimation with an Audio-visual Fusion Mechanism},
      • booktitle = {Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.},
      • year = 2021,
      • pages = {4280--4284}
      • }
    •  Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou and Haizhou Li, "Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection", Proc. of the 29th ACM Int. Conf. on Multimedia, 2021, pp. 3927-3935.
      BibTeX
      • @Inproceedings{tao2021someone,
      • author = {Tao, Ruijie and Pan, Zexu and Das, Rohan Kumar and Qian, Xinyuan and Shou, Mike Zheng and Li, Haizhou},
      • title = {Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection},
      • booktitle = {Proc. of the 29th ACM Int. Conf. on Multimedia},
      • year = 2021,
      • pages = {3927--3935}
      • }
    •  Zexu Pan, Zhaojie Luo, Jichen Yang and Haizhou Li, "Multi-Modal Attention for Speech Emotion Recognition", Proc. INTERSPEECH, 2020, pp. 364-368.
      BibTeX
      • @Inproceedings{pan2020multi,
      • author = {Pan, Zexu and Luo, Zhaojie and Yang, Jichen and Li, Haizhou},
      • title = {Multi-Modal Attention for Speech Emotion Recognition},
      • booktitle = {Proc. INTERSPEECH},
      • year = 2020,
      • pages = {364--368}
      • }
  • Software & Data Downloads