[ad_1]
To assist drum music learners, the study proposes an intelligent music score generation method based on improved CNN and STFT. To better capture the frequency features of the music signal, the study first uses STFT to transfer the sound signal from the time domain to the frequency domain. Then, the converted spectrum is analyzed and learned by the improved CNN, which in turn realizes the automatic generation of music score.
Time-frequency transformation of music signals based on STFT
The main methods for automatic generation of drum music scores are segmentation and classification-based methods and activation-based methods. The first method relies on the recognition of drumming patterns and the understanding of rhythmic structures, and has a relatively high error rate. In contrast, activation-based methods focus on modeling the drummerâs playing habits and techniques, and algorithms are used to generate drum sequences with dynamic changes and expressiveness16. Therefore, the study adopts the activation-based score generation method, as shown in Fig. 1.

Method flow of music score generation based on STFT and improved CNN.
In Fig. 1, the study combines the STFT and the improved CNN to realize the automatic generation of drum music sheet music through the operations of time-frequency conversion, feature mapping, and label reinforcement. It mainly consists of three modules, which are time-frequency conversion module, CNN activation module, and peak extraction module. The activation module uses an improved CNN model to capture the subtle rhythmic changes of the music. From the feature maps produced by the activation module, the peak extraction module is in charge of identifying noteworthy rhythmic peaks that correlate to the crucial percussion moments in drumming. Before the automatic generation of music scores, the study uses STFT to analyze the time-frequency of the sound signal and transforms it to obtain the Mel time-frequency map. The specific flow of the time-frequency transformation is shown in Fig. 2.

The specific flow of time-frequency conversion.
In Fig. 2, in order to make the captured audio signals have the same length for better feature extraction and model training in the subsequent score generation, the study performs a length-alignment process on the audio signals. The study first normalizes the audio signal and adds appropriate 0-symbol frames at the end of the processed audio segment. At the end of the processing, frame division is performed and time-frequency conversion is performed by combining the window function and discrete Fourier transform. To further improve the utilization of the audio features, further filtering is performed using a Mel filter. The specific process of converting dual-channel audio to mono in the preprocessing stage is shown in Eq. (1).
$${{\text{S}}_{\text{i}}}=\left\{ {\begin{array}{*{20}{c}} {{{\text{S}}^1}_{{\text{i}}},{\text{n}}=1} \\ {\frac{{{{\text{S}}^1}_{{\text{i}}}+{{\text{S}}^2}_{{\text{i}}}}}music,{\text{n}}=2} \end{array}} \right.$$
(1)
In Eq. (1), \({{\text{S}}^{\textmusical}}_{{\text{i}}}\) and \({{\text{S}}^{\textmusic}}_{{\text{i}}}\) are mono and dual channel audios, respectively. \({\text{n}}\) is the quantity of audio channels. \({{\text{S}}_{\text{i}}}\) is the processed mono audio. \({\text{i}}\) is the sample point sequence quantity. Sound is mainly composed of physical characteristics such as timbre, frequency, amplitude, etc., and these characteristics are more obvious in the time-frequency domain (TFD). Among them, two physical quantities, frequency and amplitude, can be clearly shown in the TFD by STFT. Moreover, the ranges of sound signals are not the same. Before the time-frequency signal conversion, it is necessary to unify the sampling rate of the audio signal and normalize the processed signal. When the maximum value of the absolute value function is equal to 0, the study sets the normalized audio signal as its initial signal. Whereas, when its maximum value is not equal to 0, it is necessary to carry out calculations to process it, as expressed in Eq. (2).
$${{\text{S}^{\prime}}_{\text{i}}}\left( {\text{i}} \right)=\frac{{{{\text{S}}_{\text{i}}}}}{{\hbox{max} \left\{ {abs\left[ {{{\text{S}}_{\textmusical}},…,{{\text{S}}_{\text{m}}}} \right]} \right\}}}$$
(2)
In Eq. (2), \({\text{abs}}\left[ {} \right]\) is the absolute value function. \({{\text{S}^{\prime}}_{\text{i}}}\) is the normalization result. \({\text{max}}\left( {} \right)\) is the maximum value screening process. After pre-processing the audio signals, the study is framing them. The framing operation is shown in Fig. 3.

The specific process of framing operation.
In Fig. 3, the study splits the audio signal into multiple short-time frames by adding window processing to better capture the local features of the signal. The length of each frame is chosen to be 20msâ~â40ms, and there is some overlap between frames to ensure the continuity of the signal. During the process of frame splitting, the choice of window function is crucial for the feature extraction of the signal. Hamming window is widely used because of its good performance in reducing spectral leakage. Therefore, the Hamming window is chosen as the window function for frame segmentation in the study. This is due to the fact that the Hamming window has a better width of the main flap and lower level of the side flap in the frequency domain, which helps to reduce the inter-frame leakage. The Hamming window function is calculated as shown in Eq. (3).
$${\text{w}}=0.54-0.46{\text{cos}}\frac{{2\pi {\text{i}}}}{{{\text{m-}}1}}$$
(3)
In Eq. (2), \({\text{m}}\) window length. \({\text{w}}\) is the window function. After framing, the study utilizes discrete Fourier transform to transform the audio signal for processing. After obtaining the spectrum, in order to further extract the characteristic parameters of the audio signal, the study uses Mel filter for further filtering process. Finally, the study uses the obtained Mel time-frequency as the input data for the CNN model. It provides data support for the subsequent music score generation.
Improved CNN-based score generation model construction for drum set
After processing the input audio signals using STFT and Mel filters, the study feeds these signals into the CNN model for activation processing. The CNN model outputs frame level activity values after activation processing and performs peak extraction, which leads to feature extraction of the audio signal. Then, the CNN is used to classify the performance information in order and map the extracted features to the individual hit points of the drum set. The hitting patterns of different drum hits in the audio signal are recognized and converted into symbols on the music score17. For the time-frequency transformed Mel time-frequency map, the study inputs the features into the CNN model on a frame-by-frame basis. The formula for obtaining the context is shown in Eq. (4).
$${\text{Con}}\left( {\text{k}} \right)={\text{concat}}\left\{ {{\text{T}}\left( {\text{k}} \right) {\text{T}}\left( {k+1} \right),…,{\text{T}}\left( {k+{\text{h}}+1} \right)} \right\}$$
(4)
In Eq. (4), \({\text{T}}\left( {\text{k}} \right)\) denotes the time-frequency of the \({\text{k}}\)th frame. \({\text{h}}\) is the context size. \({\text{concat}}\) is the splicing operation. \({\text{Con}}\left( {\text{k}} \right)\) is the context frame result. After obtaining the context of size 16 frames, the study inputs it into the CNN for mapping classification. The activity values of the drums are obtained through mapping classification, which leads to the generation of the music score. In this process, the CNN model labels the signals according to the classification and recognition results, and maps their hit points to the corresponding locations in the music score. The network structure of the CNN for the research setup is shown in Fig. 4.

The network structure of CNN.
In Fig. 4, the convolutional network structure proposed in the study contains three main parts: the downsampling layer, the pooling layer, and the fully connected layer. While the downsampling layer has four convolutional structures of 3âÃâ3 size and an activation function. After the signal is output from the CNN model, the study obtains the distribution maps of the activity values of the components of the drum set. These distribution maps can clearly show the activity level of each drum component on the time series, thus providing accurate data support for subsequent score generation. The study uses localized peaks in the activity values as starting points for drum events. The distance between two localized peaks needs to be greater than 4 frames. The study uses Sigmoid cross entropy (CE) as the loss function (LF), as shown in Eq. (5).
$${{\text{L}}_{{\text{loss}}}}=\sum\limits_{{{\text{i}}=1}}^{N} {{\xi _{\text{i}}}{L_s}\left( {{f_i}\left( x \right),{y_i}} \right)}$$
(5)
In Eq. (5), \({{\text{L}}_{{\text{loss}}}}\) is the LF of the training process. \({{\text{L}}_{\text{s}}}\left( \cdot \right)\) is the Sigmoid CE loss. \({{\text{f}}_{\text{i}}}\left( {\text{x}} \right)\) is the output after CNN. \({\xi _{\text{i}}}\) is the weight of the corresponding drum component. \({\text{N}}\) is the number of drum components. The study assigns weights to the drum components based on their variability and frequency of occurrence. The weight value of kick drum (KD) is set to 0.5, snare drum (SD) is set to 2.0, and hi-hat (HH) is set to 1.5. Whereas, the weights of high, medium, and low tom, suspended cymbal, and rhythm cymbal are 0.8, 0.8, 0.8, 1.0, 1.0, and 1.0, respectively. It is found in the course of the study that the drum events by the proposed CNN model extraction has some limitations with a single label for the training process. This leads to a reduction in the recognition accuracy of the model when dealing with complex rhythms and multiple drums intertwined. To address this issue, the study further introduces the joint labels training strategy, which allows each time frame to correspond to multiple drum labels, thus capturing complex musical structures more accurately. The joint labels labeling method is shown in Fig. 5.

The labeling method of joint labels.
In Fig. 5, joint labels contain the track signals of different drum components, and in the process of activity recognition, it is necessary to first identify the track to which the signal belongs, and then discriminate its activity value afterwards. After the introduction of joint labels training method, the study applies it to the training as well as the common label training, and adopts the self-distillation method to realize the coding layer sharing of the two classifiers. Further, it realizes the common training of the two. After the training is completed, the joint labels obtained from the joint labels trained classifiers need to be aggregated. The process of aggregation is shown in Eq. (6).
$${{\text{P}}_{{\text{x,z,u}}}}={\left( {1+{e^{ – \fracmusical{M}\sum\nolimits_{{j=1}}^{M} {{{\text{u}}^T}} }}} \right)^{{\text{-}}1}}$$
(6)
In Eq. (6), \({\text{z}}\) is the network parameters of the coding layer. \({\text{u}}\) is the network parameters of the joint classifier. \({\text{M}}\) is the quantity of tracks. \({{\text{P}}_{{\text{x,w,v}}}}\) is the label after aggregation. The research on label enhancement training is based on a multi-task learning framework using DL. This framework enhances the modelâs ability to recognize different drum components by sharing the underlying feature extraction layer. The internal convolutional structure can effectively capture the features of the TFD, enhance the expression of musical signals, and thereby improve the recognition accuracy of complex rhythms. Moreover, to further enhance the computational efficiency of the model and simplify the model structure, the study optimizes the convolutional model structure. The study first introduces the inner convolution to replace the convolutional layers in its convolutional module. The inner convolution is an efficient convolutional method, which locally connects inside each convolutional layer and reduces redundant computational steps. The improved CNN structure is shown in Fig. 6.

The improved network structure of CNN.
In Fig. 6(a), after inputting the feature map, the study efficiently compresses its feature map through the inner convolution operation, which reduces the consumption of computational resources. The internal convolution operation is not simply dot product but achieves feature extraction through local connection. Internal convolution ensures feature alignment by making local connections within each convolution kernel. The feature compression process utilizes the efficiency of internal convolution to gradually reduce the dimensionality of the feature map. This reduces computational complexity and improves model efficiency. The specific steps include feature extraction, local connection, and dimension compression. This ensures that key information is retained while redundant computations are reduced. Feature compression is achieved by replicating and expanding the feature map through channels, enhancing multi-dimensional information fusion, and further improving the feature expression ability. First, channel replication of the original feature map is performed to generate multi-channel copies. Second, local concatenation and feature extraction are performed for each channel using internal convolution. Finally, dimension compression technology merges multi-channel information to form an efficient feature representation. This ensures that key feature information is retained while reducing the computational load. In Fig. 6(b), the four downsampling layers of the improved CNN structure have more than one inner convolution module, which contains two convolution layers and one inner convolution kernel. By introducing the involution module, the number of channels in the model is further reduced. In summary, the study establishes an intelligent music score generation method based on improved CNN and STFT. First, the method extracts and transforms the audio signal features using STFT. Then, it analyzes and recognizes the signal using a CNN model that introduces label enhancement training and an involution module. Finally, it generates a music score. The convolution kernel size of the model is 3âÃâ3, with a step size of (1) The activation function uses ReLU, the pooling layer adopts Max pooling, the pooling window is 2âÃâ2, and the step size is (2) The network has four downsampling layers, each of which is followed by an involution module that effectively reduces the feature dimension. The learning rate is set to 0.001, and the Adam optimizer is employed to update the parameters and enhance convergence speed.
[ad_2]
Intelligent generation method of drum music scores based on improved CNN and STFT