Akshay Joshi is a Senior AI Research Scientist - Google, ETH Zurich, AMD, DFKI, Saarland University

Image credit: Medium

Robust Spoken Language Recognition

Last updated on Dec 9, 2020

Please open the PDF and Code links mentioned above to find detailed project report and source code!

Abstract:

Accurate Grapheme-to-Phoneme conversion is crucial for the effectiveness and success of Automatic Speech Recognition and Text-to-Speech systems. This task may perhaps be particularly challenging when building multilingual ASR agents which requires un-ambiguous (without being affected by different accents/acoustic domains/similar sounding words in analogous languages ex: Kannada & Sanskrit) conversion of textual words to phonemes.

A typical G2P process has 3 steps:

Aligning Grapheme token to Phoneme token
Learning the G to P conversion (neural/statistical methods)
Triaging the best possible pronunciation provided the model

In this machine learning project, the task is to import and efficiently parse the dataset which contains 50 phonemes (perhaps retrieved from a CNN/Transformer based neural model for grapheme-to-phoneme conversion) & their corresponding representations in the embedding space as a 236-dimensional vector. Further, to assess if these phonemes are similar or correlated to each other in terms of usage/semantics/audio signature, we perform an array of tasks ranging from calculating the Pairwise Cosine Similarities, Dimensionality Reduction using Linear & Manifold Learning methods and Clustering to uncover hidden patterns in the phoneme vector space.

Tasks:

Conduct a small research on phoneme embeddings (VSM).
Read the dataset into a suitable data structure (e.g., Pandas data frame, Python dictionary, Numpy array, etc.)
Computing the pair-wise cosine similarity between the phonemes represented by the embeddings and obtaining a confusion matrix of similarity scores.
Exploring the embeddings space with at different techniques. Perhaps, using dimensionality reduction and visualization (e.g., PCA, t-SNE), as well as a different clustering analysis methods.

Implemented Functions:

Pairwise Cosine Similarity Heatmap/Confusion Matrix
Agglomerative Clustering & Dendrogram Visualization
Priniciple Component Analysis (PCA)
Independent Component Analysis (ICA)
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Multidimensional Scaling (MDS - Metric)
PCA - DBSCAN Clustering

Conclusion

A state-of-the-art deep neural network for ASR/Grapheme-to-Phoneme conversion must efficiently grasp the semantics, syntax & usage of words in relatively analogous languages and multiple speech accents of the same language in all acoustic conditions to discriminate and classify similar sounding phonemes into corresponding clusters/reduce the pairwise distance in the vector space. Also, there appears to be moderate amounts of similarity among phoneme vector pairs in the right bottom corner of the heat map. Finally, it is onerous to comment further on the effectiveness of similarity/dissimilarity of phonemes in the provided task without the additional context of the problem and the type of neural model used or the languages processed for G2P conversion.

Machine and Deep Learning Natural Language Processing

Robust Spoken Language Recognition

Abstract:

Tasks:

Implemented Functions:

Conclusion

Akshay Joshi

Senior AI Research Scientist