Music Source Separation Using ASPP Based on Coupled U-Net Model

Suwon Yang, Daewon Lee

Abstract


Noise has established itself as one of the factors that interfere with modern human life, and various noise canceling techniques have been studied to prevent noise. While the old era's noise-canceling technique focused on the physical soundproofing technique, multiple studies have been conducted on the active noise canceling technique that removes only the activated noise in the current era. Active noise canceling (ANC) or digital noise-canceling technology is based on the sound source separation method. This leads to sound source separation technology, which refers to the technology to separate individual sound signals from mixture sounds. Most of the source separation technologies focus on improving speech, not noise reduction. This technology makes it possible to obtain desired sound information more accurately and further improves noise-canceling technology by eliminating unwanted sound information. To provide deeper capability and more enhanced sound separation than the existing structure, we are focused on coupled U-Net model and Atrous spatial pyramid pooling technique (ASPP). This paper presents the music source separation method that combined Coupled U-Net structure with Atrous spatial pyramid pooling technique. To prove the proposed source separation method, we compared GNSDR, GSIR, and GSAR using MIR-1K, a data set that can evaluate the performance of the music source separation. Performance results show that the proposed source separation method overcame other methods' disadvantages and strengthened the feature map.


Keywords


Active noise canceling; sound source separation; music source separation; U-net structure; coupled U-Net structure; Atrous Spatial Pyramid Pooling.

Full Text:

PDF

References


Tang, Zhiqiang, et al. “CU-net: coupled U-nets.†arXiv preprint arXiv:1808.06521 (2018).

Chen, Liang-Chieh, et al. “Encoder-decoder with atrous separable convolution for semantic image segmentation.†Proceedings of the European conference on computer vision (ECCV). 2018.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.†International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015.

Huang, Gao, et al. “Densely connected convolutional networks.†Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Park, Sungheon, et al. “Music source separation using stacked hourglass networks.†arXiv preprint arXiv:1805.08559 (2018).

Yuan, Weitao, et al. “Enhanced feature network for monaural singing voice separation.†Speech Communication 106 (2019): 1-6.

Yang, Yi-Hsuan. “Low-rank representation of both singing voice and music accompaniment via learned dictionaries.†ISMIR. 2013.

Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. “Singing voice separation with deep u-net convolutional networksâ€. 18th International Society for Music Information Retrieval Conferenceng, Suzhou, China, 2017.

Stoller, Daniel, Sebastian Ewert, and Simon Dixon. “Wave-u-net: A multi-scale neural network for end-to-end audio source separation.†arXiv preprint arXiv:1806.03185 (2018).

Défossez, Alexandre, et al. “Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed.†arXiv preprint arXiv:1909.01174 (2019).

Takahashi, Naoya, and Yuki Mitsufuji. “Multi-scale multi-band densenets for audio source separation.†2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017.

Stöter, Fabian-Robert, et al. “Open-unmix-a reference implementation for music source separation.†(2019).

Ince, Gökhan, et al. “Ego noise suppression of a robot using template subtraction.†2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2009.

Nakajima, Hirofumi, et al. “An easily-configurable robot audition system using histogram-based recursive level estimation.†2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010.

Nakajima, Hirofumi, et al. “An easily-configurable robot audition system using histogram-based recursive level estimation.†2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010.

Luo, Yi, et al. “Deep clustering and conventional networks for music separation: Stronger together.†2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.

Yu, Dong, et al. “Permutation invariant training of deep models for speaker-independent multi-talker speech separation.†2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.

Jansson, Andreas, et al. “Singing voice separation with deep U-Net convolutional networks.†(2017).

Stoller, Daniel, Sebastian Ewert, and Simon Dixon. “Wave-u-net: A multi-scale neural network for end-to-end audio source separation.†arXiv preprint arXiv:1806.03185 (2018).

Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. “Segnet: A deep convolutional encoder-decoder architecture for image segmentation.†IEEE transactions on pattern analysis and machine intelligence 39.12 (2017): 2481-2495.

Tan, Ke, and DeLiang Wang. “A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement.†Interspeech. Vol. 2018. 2018.

Chen, Liang-Chieh, et al. “Encoder-decoder with atrous separable convolution for semantic image segmentation.†Proceedings of the European conference on computer vision (ECCV). 2018.




DOI: http://dx.doi.org/10.18517/ijaseit.11.2.12833

Refbacks

  • There are currently no refbacks.



Published by INSIGHT - Indonesian Society for Knowledge and Human Development