Conceptual model of the technology for calculating the similarity threshold of two audio sequences

Автор(и)

DOI:

https://doi.org/10.18664/ikszt.v29i3.313703

Ключові слова:

machine learning, mfcc, dtw, feature extraction, speaker recognition, classification, voice cloning, siamese neural networks

Анотація

The paper is focused on the pressing problem of speaker verification by means of voice time series comparison. The aim of  this paper is to determine the orders of mel-frequency cepstral coefficients that most accurately describe the difference,  between an authentic voice and an artificially generated copy for their further use as input to a neural network model in a  resource-limited environment. To achieve this goal, the following tasks were accomplished: a conceptual model of the  technology for determining the similarity threshold of two audio series was developed; the orders of fine-frequency cepstral  coefficients with the most characteristic differences between the recording and the generated voice were determined on the  basis of neural network analysis; an experimental study of the dependence of the execution time and computational load on  the created feature vector when assessing the degree of similarity of two time series was conducted; and the optimal similarity threshold was determined on the basis of the chosen dataset. The developed model of the technology for determining the  similarity threshold was tested on a dataset that is a combination of the DEEP-VOICE dataset and our own dataset. The  demonstrated result of applying the developed technology showed an increase of 43% when using the specified MFCCs  compared to using all of them. Based on experimental studies, the DTW acceptance threshold was set at 0.37.

Біографія автора

Владислав Олександрович Холєв, Kharkiv National University of Radio Electronics "NURE"

Professor Assistant at Electronic Computers Department

Посилання

Sidhu, Manjit & Latib, Nur & Sidhu, Kirandeep. (2024). MFCC in audio signal processing for voice disorder: a review. Multimedia Tools and Applications. 1-21. 10.1007/s11042-024-19253-1.

Холєв В., Барковська О. COMPARATIVE ANALYSIS OF NEURAL NETWORK MODELS FOR THE PROBLEM OF SPEAKER RECOGNITION // СУЧАСНИЙ СТАН НАУКОВИХ ДОСЛІДЖЕНЬ ТА ТЕХНОЛОГІЙ В ПРОМИСЛОВОСТІ. – 2023. – №. 2 (24). – С. 172-178.

Zheng, Fang & Zhang, Guoliang & Song, Zhanjiang. (2001). Comparison of Different Implementations of MFCC.. J. Comput. Sci. Technol.. 16. 582-589. 10.1007/BF02943243.

Dave, Namrata. (2013). Feature extraction methods LPC, PLP and MFCC in speech recognition. International Journal For Advance Research in Engineering And Technology(ISSN 2320-6802). Volume 1.

Sharma, Garima & Umapathy, Kartikeyan & Krishnan, Sridhar. (2020). Trends in audio signal feature extraction methods. Applied Acoustics. 158. 107020. 10.1016/j.apacoust.2019.107020.

Abdullah Mueen, Eamonn J. Keogh: Extracting Optimal Performance from Dynamic Time Warping. KDD 2016: 2129-2130

Yurika Permanasari et al 2019 J. Phys.: Conf. Ser. 1366 012091

Bird, J. J., & Lotfi, A. (2023). Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion. arXiv preprint arXiv:2308.12734.

Retrieval-based-Voice-Conversion-WebUI. (n.d.). github.com. https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/

Kholiev, V., Barkovska, O. (2023), "Analysis of the of training and test data distribution for audio series classification", Information and control systems at railway transport, No. 1, P. 38-43. DOI: https://doi.org/10.18664/ikszt.v28i1.276343

Chicco, D. (2021). Siamese Neural Networks: An Overview. In: Cartwright, H. (eds) Artificial Neural Networks. Methods in Molecular Biology, vol 2190. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0826-5_3.

##submission.downloads##

Опубліковано

2024-10-25