Analysis of the of training and test data distribution for audio series classification



Ключові слова:

datasets; pre-processing; machine learning; cross validation; librispeach; librivox


The effectiveness of machine learning algorithms for any given task largely depends on the training and test datasets. This manifests itself not only in the amount of data, but also in its content (that is, its relevance for the task at hand), as well as in its organization. Generally, the common approach is to split the dataset into training and testing sets to avoid model overfitting. In addition, to achieve better metrics for the selected criteria (accuracy, learning rate, etc.) of model performance, different ratios of training and test sets are used in the partitioning. The goal of this paper is to analyze methods of data set partitioning for use in training neural networks and statistical models. One of the reviewed methods, specifically the cross-validation method, was applied to a dataset developed from the LibriSpeach corpus, an open English speech corpus based on the LirbiVox project of voluntarily contributed audio books. The result of applying the selected data partitioning method on the selected data set is demonstrated.

Біографії авторів

Vladyslav Kholiev, Kharkiv National University of Radio Electronics "NURE"

postgraduate of Department of Electronic Computers

Olesia Barkovska, Kharkiv National University of Radio Electronics "NURE"

PhD, Аssоcіаtе Prоfеssоr of Department оf Electronic Computers


Coalson J., “FLAC – What is FLAC”, available at: (last accessed 08.12.2022).

“RFC 2361: WAVE and AVI codec registers”,

available at:

(last accessed 08.12.2022).

Kabal P., “Audio File Format Specifications - AIFF /

AIFF-C Specification”, available at:

mats/AIFF/AIFF.html (last accessed 08.12.2022).

“MP3 and AAC Explained (archived from the

original)”, available at:

c_brandenburg.pdf (last accessed 08.12.2022).

Stone, M (1974). "Cross-Validatory Choice and

Assessment of Statistical Predictions". Journal of the

Royal Statistical Society, Series B (Methodological).

(2): 111–147. doi:10.1111/j.2517-


R. K. H. Galvão, M. C. U. Araujo, G. E. José, M. J. C.

Pontes, E. C. Silva, and T. C. B. Saldanha, A method

for calibration and validation subset partitioning,

Talanta 67 (2005), no. 4, 736–740.

V. Roshan Joseph & Akhil Vakayil (2022) SPlit: An

Optimal Method for Data Splitting, Technometrics,

:2, 166-176, DOI: 10.1080/00401706.2021.1921037

Joseph, V. R., Optimal ratio for data splitting, Stat.

Anal. Data Min.: ASA Data Sci. J.. 15 (2022), 531–

V. Panayotov, G. Chen, D. Povey and S. Khudanpur,

"Librispeech: An ASR corpus based on public domain

audio books," 2015 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP),

, pp. 5206-5210, doi: