Deep Neural Network Embedding For Text-Independent Speaker Verification

AI 모델 2020. 12. 28. 17:36

Overview

Replace i-vectors to embedding extracted from feed-forward deep neural network for text-independent speaker verification
Use temporal pooling layer to aggregate variable length of speech segments

Features
- 20-dimensional MFCC with 25ms frame-length
- Energy-based VAD is applied
Model: Time-delay neural network (TDNN)
- The first 5 layers work at the frame level
- The statistic pooling layer aggregates over the input segment and compute its mean and standard deviation
- Two fully connected layer control dimension to 512 and 300 (number of speaker)
- Total parameter: 4.4M
- Use embedding a and b
PLDA backend
- Embedding is reduced using LDA
- Embedding length normalization is applied
- PLDA scores are normalized using adative s-norm

Propose DNN-based frame-level feature extraction for text-dependent speaker verification
Overall, the embeddings seem to be competitive with traditional i-vector
For short utterance, DNN-based feature shows better performance

Anomaly Detection-Based Unknown Face Presentation Attack Detection (0)	2021.01.04
One-Class Convolutional Neural Network (OC-CNN) (0)	2021.01.04
Angular Prototypical Loss (0)	2020.12.28
Prototypical Networks for Few-shot Learning (0)	2020.12.24
CosFace: Large Margin Cosine Loss for Deep Face Recognition (0)	2020.12.23