Spatial audio formats like Ambisonics are playback device layout-agnostic and well-suited for applications such as teleconferencing and virtual reality. Conventional Ambisonic encoding methods often rely on spherical microphone arrays for efficient sound field capture, which limits their flexibility in practical scenarios. We propose a deep learning (DL)-based approach, leveraging a two-stage network architecture for encoding circular microphone array signals into second-order Ambisonics (SOA) in multi-speaker environments. In addition, we introduce: (i) a novel loss function based on spatial power maps to regularize inter-channel correlations of the Ambisonic signals, and (ii) a channel permutation technique to resolve the ambiguity of encoding vertical information using a horizontal circular array. Evaluation on simulated speech and noise datasets shows that our approach consistently outperforms traditional signal processing (SP) and DL-based methods, providing significantly better timbral and spatial quality and higher source localization accuracy.
The proposed neural network architecture consists of two stages: In the first stage, the complex ratio filters (CRFs) are estimated to generate L virtual loudspeaker signals, conceptually similar to the plane wave decomposition of the captured sound field; in the second stage, another set of CRFs are estimated to spatially transform the virtual loudspeaker signals to the Ambisonics domain, similar to the Ambisonics synthesis process.
All binaural audio demos are decoded from the ground truth and estimated second-order Ambisonics with different encoding methods (see the paper for details), using the KEMAR HRTF dataset[1]. The spatial power map in the video is generated by processing the Ambisonics signals with the SPARTA PowerMap plugin[2] under the "PWD" mode. The power map is visualized as a heatmap, where the color represents the sound intensity at different directions. The azimuth angles are defined in the range of [-180°, 180°], where 0°, 90°, and -90° are the front, left, and right directions, respectively. The elevation angles are defined in the range of [-90°, 90°], where 90°, 0°, and -90° are the top, horizontal, and bottom directions, respectively. Headphones are recommended for the intended spatial audio experience.
Reference | LS-based Filtering II | U-net-based[3] | Proposed | |
---|---|---|---|---|
Sample 1 | ||||
Sample 2 | ||||
Sample 3 |
Reference | LS-based Filtering II | U-net-based[3] | Proposed | |
---|---|---|---|---|
Sample 4 | ||||
Sample 5 | ||||
Sample 6 |
Reference | LS-based Filtering II | U-net-based[3] | Proposed | |
---|---|---|---|---|
Sample 7 | ||||
Sample 8 | ||||
Sample 9 |
@misc{qiao2024neuralambisonicencodingmultispeaker,
title={Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array},
author={Yue Qiao and Vinay Kothapally and Meng Yu and Dong Yu},
year={2024},
eprint={2409.06954},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2409.06954},
}
[1] Gardner, William G., and Keith D. Martin. "HRTF measurements of a KEMAR." The Journal of the Acoustical Society of America 97.6 (1995): 3907-3908.
[2] McCormack, Leo, and Archontis Politis. "SPARTA & COMPASS: Real-time implementations of linear and parametric spatial audio reproduction and processing methods." AES International Conference on Immersive and Interactive Audio. Audio Engineering Society, 2019.
[3] Heikkinen, Mikko, Archontis Politis, and Tuomas Virtanen. "Neural Ambisonics encoding for compact irregular microphone arrays." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.