# 计算机代写|深度学习代写Deep Learning代考|CS7643 Time-Dependency Model

## 计算机代写|深度学习代写Deep Learning代考|Time-Dependency Model

In this subsection, different time-dependency models are compared. Figure 3.8a shows the results for the model with CNN framewise modelling and attentionpooling and three different time-dependency models. The model on the very lefthand side is again the CNN-SA-AP model with self-attention for time-dependency modelling, the CNN-LSTM-AP model uses a BiLSTM network with one layer and 128 hidden units, and the CNN-Skip-AP model skips the time-dependency stage and directly pools the $\mathrm{CNN}$ outputs. It can be seen that difference between the different time-dependency models is not as large as for the framewise models. That being said, it can be noted that the inclusion of a time-dependency model improves the overall performance, where the self-attention network shows to outperform the LSTM network. The PCC of the self-attention network is on average around 0.15 higher than the $\mathrm{PCC}$ of the model without time dependency.

The different time-dependency models can also be combined by applying an LSTM network with the following self-attention network or vice versa. The results of these combinations are shown in Fig. $3.8 \mathrm{~b}$, where it can be seen that both combinations perform on average worse than using only an LSTM or SA network. Overall, it can be concluded that it is beneficial for a speech quality model to include a time-dependency stage. The difference between using a self-attention or LSTM network for this purpose is only small, but the SA network showed to give the best results on average.

## 计算机代写|深度学习代写Deep Learning代考|Pooling Model

In this subsection different pooling models are compared, where for framewise modelling a CNN is applied. The different pooling mechanisms are average-pooling (Avg), max-pooling (Max), and attention-pooling. In the case of the recurrent LSTM network, also the output of the last time step can be used for pooling (Last).

Figure $3.9 \mathrm{~b}$ in the middle compares the proposed CNN-SA-AP model to CNNSA models with max- or average-pooling. Although the performance difference is relatively small, it can be seen that attention-pooling outperforms the other two methods, where average-pooling and max-pooling achieve approximately the same results. In the case of using an LSTM for time-dependency modelling in Fig. 3.9c, attention-pooling again outperforms the other mechanisms, where average-pooling achieves similar results. Last-step-pooling and max-pooling are outperformed by the other methods. In contrast to that, in Fig. 3.9a, where the output of the CNN is directly pooled without TD model, max-pooling outperforms average pooling. Overall, again attention-pooling achieves the best results. Interestingly, in the case that no TD model is applied, max-pooling gives better results than average-pooling, which is in line with the quality perception of speech communication users that tend to give more weight to segments in the speech file with poor quality. However, this out-weighting of poor-quality segments seems to be modelled by the timedependency stage, and therefore average-pooling achieves similar or better results than max-pooling if a TD stage is included. Overall, it can be concluded that the choice of the pooling stage only has a small influence on the performance if selfattention is applied and that attention-pooling achieves on average the best result for all tested cases.

