Monday, November 14
Oral 1: Speech Synthesis
Monday, 14 November 2022 (9:20-10:40)
Chair: Antonio Bonafonte
O1.1 9:20 – 09:40 |
Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech (abs We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS, as an enhancement to the well-known Variational Autoencoder (VAE) and Vector Quantized Variational Autoencoder (VQ-VAE) architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while keeping significant representation power and a discretized latent space small enough for efficient prediction from text. We train the model on recordings in the expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE latent acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%. ) |
Marek Strelec, Jonas Rohnke, Antonio Bonafonte, Mateusz Lajszczak, Trevor Wood | |
O1.2 09:40 – 10:00 |
An animated realistic head with vocal tract for the finite element simulation of vowel /a/ (abs Three-dimensional (3D) acoustic models can accurately simulate the voice production mechanism. These models require detailed 3D vocal tract geometries through which sound waves propagate. A few open source databases typically based on magnetic resonance imaging (MRI) are already available in literature. However, the 3D geometries they contain are mainly focused on the vocal tract and remove the head, which limits the computational domain of the simulations. This work develops a unified model consisting of an MRI-based vocal tract geometry set in a realistic head. The head is generated from scratch based on anatomical data of another subject, and contains different layers that add an organic appearance to the character. It is then not only designed to allow accurate finite element simulations of vowels, but more importantly, it can also be animated to add a realistic visual layer to the generated sound. This is expected to help in the dissemination of results and also to open potential applications in the audiovisual and animation sector. This paper )shows the first results of the model focusing on the vowel /a/. |
Marc Arnela, Leonardo Pereira-Vivas, Jorge Egea | |
O1.3 10:00 – 10:20 |
Exploring the limits of neural voice cloning: A case study on two well-known personalities (abs This work describes one successful and one failed Voice Cloning processes of two famous personalities in order to be broadcast in a high-impact podcast and in a Spanish public television program. Whilst a good quality synthesised voice could be generated for the first public figure, the second one was not adequate enough for its broadcast on television given its low speech quality. In this study, we explore the limits of the neural voice cloning considering the different conditions of the training material employed in each case and, based on several objective measures (amount of training data, phoneme coverage, SNR, MCD and PESQ), we analysed the main features to be considered for a high-quality synthetic voice generation. In addition, a webpage is provided in which samples of the resulting audios are available for each cloning model. ) |
Ander González-Docasal, Aitor Álvarez, Haritz Arzelus | |
O1.4 10:20 – 10:40 |
Analysis of iterative adaptive and quasi closed phase inverse filtering techniques on OPENGLOT synthetic vowels (abs Three-dimensional source-filter models allow for the articulatory-based generation of voice but with limited expressiveness yet. From the analysis of expressive speech corpora through glottal inverse filtering techniques, it has been observed that both the vocal tract and the glottal source play a key role in the generation of different phonation types. However, the accuracy of the source-filter decomposition depends on the considered technique. Current Quasi Closed Phase (QCP) and Iterative Adaptive Inverse Filtering (IAIF) based approaches present pretty good results, despite difficult to compare as they are obtained from different experiments. This work aims at evaluating the performance of these stateof- the-art methods on the reference OPENGLOT database, using its repository with synthetic vowels generated with different phonation types and fundamental frequencies. After optimizing the parameters of each inverse filtering approach, their performance is compared considering typical glottal flow error measures. The results show that QCP-based techniques attain statistically significant lower values in most measures. IAIF variants achieve a significant improvement on the spectral tilt error measure with respect to the original IAIF, but they are surpassed by QCP when spectral tilt compensation is applied. ) |
Marc Freixes, Joan Claudi Socoró and Francesc Alías |
Keynote 1
Monday, 14 November 2022 (11:00-12:00)
KN1 11:00 – 12:00 |
Secure and explainable voice biometrics (abs Anti-spoofing for voice biometrics is now an established area of research, thanks to the four competitive ASVspoof challenges (the fifth is currently underway) that have taken place over the past decade. Growing research effort has invested, firstly, in the development of front-end representations that capture more reliably the tell-tale artefacts that are indicative of utterances generated with text-to-speech and voice conversion algorithms and, secondly, in the development of deep and end-to-end solutions. Despite enormous efforts and positive achievements, little is still known about the artefacts these recognisers use to identify spoofing utterances or distinguish between bona fide and spoofed. Although many unanswered questions remain, this talk aims to provide insights and inspirations, through examples, into the behaviour of voice anti-spoofing systems. Particular attention will be given to data augmentation and boosting methods that have been shown instrumental to reliability. The ultimate goal is to better understand these artefacts from a physical and perceptual point of view and how they are actually seen by automatic processes, which puts us in a better position to design more reliable countermeasures. ) |
Massimiliano Todisco |
Posters 1: Topics on Speech and Language Technologies
Monday, 14 November 2022 (12:00-13:30)
Chair: Antonio Teixeira
P1.1 12:00 – 13:30 |
An Experimental Study on Light Speech Features for Small-Footprint Keyword Spotting (abs Keyword spotting (KWS) is, in many instances, intended to run on smart electronic devices characterized by limited computational resources. To meet computational constraints, a series of techniques —ranging from feature and acoustic model parameter quantization to the reduction of the number of model parameters and required multiplications— has been explored in the literature. With this same aim, in this paper, we study a straightforward alternative consisting of the reduction of the spectro/cepstro-temporal resolution of log-Mel and Melfrequency cepstral coefficient feature matrices commonly employed in KWS. We show that the feature matrix size has a strong impact on the number of multiplications/energy consumption of a state-of-the-art KWS acoustic model based on convolutional neural network. Experimental results demonstrate that the number of elements in commonly used speech feature matrices can be reduced by a factor of 8 while essentially maintaining KWS performance. Even more interestingly, this size reduction leads to a 9.6× number of multiplications/energy consumption, 4.0× training time and 3.7× inference time reduction. ) |
Iván López-Espejo, Zheng-Hua Tan and Jesper Jensen | |
P1.2 12:00 – 13:30 |
S3prl-Disorder: Open-Source Voice Disorder Detection System based in the Framework of S3PRL-toolkit (abs This paper introduces S3prl-Disorder, an open-source toolkit for Automatic Voice Disorder Detection (AVDD) developed in the framework of the S3prl toolkit. It focuses on a binary classification task between healthy and pathological speech in the Saarbruecken Voice Database (SVD). However, the framework left room for following extensions to multi-class classification to differentiate among pathologies and to incorporate more datasets. This work aims to contribute on the development of automatic systems for diagnosis, treatment, and monitoring of voice pathologies in a common framework, that allows reproducibility and comparability among systems and results. ) |
Dayana Ribas, Miguel Angel Pastor Yoldi, Antonio Miguel, David Martínez, Alfonso Ortega and Eduardo Lleida | |
P1.3 12:00 – 13:30 |
Active Learning Improves the Teacher’s Experience: A Case Study in a Language Grounding Scenario (abs Active Learning, that is, assigning the responsibility of learning to the students, is an important tool in education as it makes the students become engaged in and think about the things they do. A similar concept was adopted in the context of Machine Learning as a means to reduce the annotation effort by selecting the examples that are most relevant or provide more information at a given time. Most studies on this subject focus on the learner’s performance. However, in interactive scenarios, the teacher’s experience is also a relevant aspect, as it affects their willingness to interact with artificial learners. In this paper, we address that aspect by performing a case study in a language grounding scenario, in which humans have to engage in dialog with a learning agent and teach it how to recognize observations of certain objects. Overall, the results of our experiments show that humans prefer to interact with an active learner, as it seems more intelligent, gives them a better perception of its knowledge, and makes the dialog more natural and enjoyable. ) |
Filipe Reynaud, Eugénio Ribeiro and David Martins de Matos | |
P1.4 12:00 – 13:30 |
The role of window length and shift in complex-domain DNN-based speech enhancement (abs Deep learning techniques have widely been applied to speech enhancement as they show outstanding modeling capabilities that are needed for proper speech-noise separation. In contrast to other end-to-end approaches, masking-based methods consider speech spectra as input to the deep neural network, providing spectral masks for noise removal or attenuation. In these approaches, the Short-Time Fourier Transform (STFT) and, particularly, the parameters used for the analysis/synthesis window, plays an important role which is often neglected. In this paper, we analyze the effects of window length and shift on a complex-domain convolutional-recurrent neural network (DCCRN) which is able to provide, separately, magnitude and phase corrections. Different perceptual quality and intelligibility objective metrics are used to assess its performance. As a result, we have observed that phase correc- )tions have an increased impact with shorter window sizes. Similarly, as window overlap increases, phase takes more relevance than magnitude spectrum in speech enhancement. |
Celia García-Ruiz, Angel M. Gomez and Juan M. Martín-Doñas | |
P1.5 12:00 – 13:30 |
Neural Detection of Cross-lingual Syntactic Knowledge (abs In recent years, there has been prominent development in pretrained multilingual )language models, such as mBERT, XLMR, etc., which are able to capture and learn linguistic knowledge from input across a variety of languages simultaneously. However, little is known about where multilingual models localise what they have learnt across languages. In this paper, we specifically evaluate cross-lingual syntactic information embedded in CINO, a more recent multilingual pre-trained language model. We probe CINO on Universal Dependencies treebank datasets of English and Chinese Mandarin for two syntax-related layerwise evaluation tasks: Part-of- Speech Tagging at token level and Syntax Tree-depth Prediction at sentence level. The results of our layer-wise probing experiments show that token-level syntax is localisable in higher layers and consistency is shown across the typologically different languages, whereas sentencelevel syntax is distributed across the layers in typology- specific and universal manners. |
Yongjian Chen and Mireia Farrús | |
P1.6 12:00 – 13:30 |
Efficient Transformers for End-to-End Neural Speaker Diarization (abs The recently proposed End-to-End Neural speaker Diarization framework (EEND) )handles speech overlap and speech activity detection natively. While extensions of this work have reported remarkable results in both two-speaker and multispeaker diarization scenarios, these come at the cost of a long training process that requires considerable memory and computational power. In this work, we explore the integration of efficient transformer variants into the Self-Attentive EEND with Encoder-Decoder based Attractors (SA-EEND EDA) architecture. Since it is based on Transformers, the cost of training SA-EEND EDA is driven by the quadratic time and memory complexity of their self-attention mechanism. We verify that the use of a linear attention mechanism in SA-EEND EDA decreases GPU memory usage by 22%. We conduct experiments to measure how the increased efficiency of the training process translates into the two-speaker diarization error rate on CALLHOME, quantifying the impact of increasing the size of the batch, the model or the sequence length on training time and diarization performance. In addition, we propose an architecture combining linear and softmax attention that achieves an acceleration of 12% with a small relative DER degradation of 2%, while using the same GPU memory as the softmax attention baseline. |
Sergio Izquierdo del Alamo, Beltrán Labrador, Alicia Lozano-Diez and Doroteo T. Toledano | |
P1.7 12:00 – 13:30 |
CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech (abs With the advent of technology, the availability of linguistic data in digital format has been increasingly encouraged to facilitate its use not only in different areas of )Linguistics but also in related areas, such as natural language processing. Inspired by a protocol for digitizing the NURC (‘Cultured Linguistic Urban Norm’) project collection — one of the most influential in Brazilian Linguistics —, this paper aims to present the text-to-speech alignment process of the NURC-Sao Paulo Minimal ̃ Corpus. This subcorpus comprises 21 audio files and audioaligned multilevel transcripts according to linguistically motivated intonation units (≃18 hours, ≃155 k words), covering three text genres. The dataset — currently used to evaluate methods for processing the entire NURC-SP corpus — is publicly available on the Portulan Clarin repository [CC BY-NC-ND 4.0] (https://hdl.handle.net/21.11129/0000-000F-73CA-C). |
Vinícius G. Santos, Caroline Adriane Alves, Bruno Baldissera Carlotto, Bruno Angelo Papa Dias, Lucas Rafael Stefanel Gris, Renan de Lima Izaias, Maria Luiza Azevedo de Morais, Paula Marin de Oliveira, Rafael Sicoli, Flaviane Romani Fernandes Svartman, Marli Quadros Leite and Sandra Maria Aluísio | |
P1.8 12:00 – 13:30 |
Speaker Characterization by means of Attention Pooling (abs State-of-the-art Deep Learning systems for speaker verification are commonly )based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head SelfAttention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection. |
Federico Costa, Miquel India and Javier Hernando | |
P1.9 12:00 – 13:30 |
Enhancing the Design of a Conversational Agent for an Ethical Interaction with Children (abs Conversational agents (CAs) have become one of the most popular applications of speech and language technologies in the last decade. Those agents employ speech interaction to perform several tasks, from information retrieval to purchase goods from on-line stores. However, these agents are defined to address a general sector of population, mainly adults without speech production problems, and then they fail to obtain a similar performance with specific groups, such as elderly or children. The case of children is particularly interesting because they naturally engage in )interaction with those CAs and they have special needs in terms of technical and ethical considerations. Therefore, CAs must fulfil some conditions that could affect their general design in order to provide a trustworthy interaction with children. In this article we present how to improve a general CA design to fulfil the specific ethical needs of children interaction. We address the development of a CA devoted to complete a wish list of games using user preferences, and its improvements towards children. |
Marina Escobar-Planas, Emilia Gómez, Carlos-D. Martínez-Hinarejos | |
P1.10 12:00 – 13:30 |
Sentiment Analysis in Portuguese Dialogues (abs Sentiment analysis in dialogue aims at detecting the sentiment expressed in the )utterances of a conversation, which may improve human-computer interaction in natural language. In this paper, we explore different approaches for sentiment analysis in written Portuguese dialogues, mainly related to customer support in Telecommunications. If integrated into a conversational agent, this will enable the automatic identification and a quick reaction upon clients manifesting negative sentiments, possibly with human intervention, hopefully minimising the damage. Experiments were performed in two manually annotated real datasets: one with dialogues from the call-center of a Telecommunications company (TeleComSA); another of Twitter conversations primarily involving accounts of Telecommunications companies. We compare the performance of different machine learning approaches, from traditional to more recent, with and without considering previous utterances. The Finetuned BERT achieved the highest F1 Scores in both datasets, 0.87 in the Twitter dataset, without context, and 0.93 in the TeleComSA, considering context. These are interesting results and suggest that automated customer-support may benefit from sentiment detection. Another interesting finding was that most models did not benefit from using previous utterances, suggesting that, in this scenario, context does not contribute much, and classifying the current utterance can be enough. |
Isabel Carvalho, Hugo Gonçalo Oliveira and Catarina Silva | |
P1.11 12:00 – 13:30 |
On the application of conformers to logical access voice spoofing attack detection (abs Biometric systems are exposed to spoofing attacks which may compromise their security, and automatic speaker verification (ASV) is no exception. To increase )the robustness against such attacks, anti-spoofing systems have been proposed for the detection of spoofed audio attacks. However, most of these systems can not capture long-term feature dependencies and can only extract local features. While transformers are an excellent solution for the exploitation of these long-distance correlations, they may degrade local details. On the contrary, convolutional neural networks (CNNs) are a powerful tool for extracting local features but not so much for capturing global representations. The conformer is a model that combines the best of both techniques, CNNs and transformers, to model both local and global dependencies and has been used for speech recognition achieving state-of-the-art performance. While conformers have been mainly applied to sequence-to-sequence problems, in this work we make a preliminary study of their adaptation to a bi- nary classification task such as anti-spoofing, with focus on synthesis and voice- conversion-based attacks. To evaluate our proposals, experiments were carried out on the ASVspoof 2019 logical access database. The experimental results show that the proposed system can obtain encouraging results, although more research will be required in order to outperform other state-of-the-art systems. |
Eros Rosello, Alejandro Gomez-Alanis, Manuel Chica, Angel M. Gomez, Jose A. Gonzalez and Antonio M. Peinado | |
P1.12 12:00 – 13:30 |
Speech emotion recognition in Spanish TV Debates (abs Emotion recognition from speech is an active field of study that can help build )more natural human–machine interaction systems. Even though the advancement of deep learning technology has brought improvements in this task, it is still a very challenging field. For instance, when considering real life scenarios, things such as tendency toward neutrality or the ambiguous definition of emotion can make labeling a difficult task causing the data-set to be severally imbalanced and not very representative. In this work we considered a real life scenario to carry out a series of emotion classification experiments. Specifically, we worked with a labeled corpus consisting of a set of audios from Spanish TV debates and their respective transcriptions. First, an analysis of the emotional information within the corpus was conducted. Then different data representations were analyzed as to choose the best one for our task; Spectrograms and UniSpeech-SAT were used for audio representation and DistilBERT for text representation. As a final step, Multimodal Machine Learning was used with the aim of improving the obtained classification results by combining acoustic and textual information. |
Irune Zubiaga, Raquel Justo, M. Inés Torres and Mikel De Velasco | |
P1.13 12:00 – 13:30 |
Assessing Transfer Learning and automatically annotated data in the development of Named Entity Recognizers for new domains (abs With recent advances Deep Learning, pretrained models and Transfer Learning, the lack of labeled data has become the biggest bottleneck preventing use of Named Entity Recognition (NER) in more domains and languages. To relieve the pressure of costs and time in the creation of annotated data for new domains, we proposed recently automatic annotation by an ensemble of NERs to get data to train a Bidirectional Encoder Representations from Transformers (BERT) based NER for Portuguese and made a first evaluation. Results demonstrated the method has potential but were limited to one domain. Having as main objective a more indepth assessment of the method capabilities, this paper presents: (1) evaluation of the method in other domains; (2) assessment of the generalization capabilities of the trained models, by applying them to new domains without retraining; (3) assessment of additional training with in-domain data, also automatically annotated. Evaluation, performed using the test part of MiniHAREM, Paramopama and LeNER )Portuguese datasets, confirmed the potential of the approach and demonstrated the capability of models previously trained for tourism domain to recognize entities in new domains, with better performance for entities of types PERSON, LOCAL and ORGANIZATION. |
Emanuel Matos, Mário Rodrigues and António Teixeira | |
P1.14 12:00 – 13:30 |
On the detection of acoustic events for public security: the challenges of the counter-terrorism domain (abs Massive amounts of audio-visual contents are shared in public platforms everyday. These contents are created with many purposes, from entertaining or teaching, )to extremist propaganda. Civil security actors need to monitor these platforms to detect and neutralize security threats. Generating actionable knowledge from multimedia contents requires the extraction of multiple information, from linguistic data to sounds and background noises. Information extraction demands audio-visual an- notations, a costly, time-consuming task when performed manually, which hinders the analysis of such an overwhelming amount of data. This work, performed in the context of the EU Horizon 2020 Project AIDA, addresses the challenge of building a robust sound detector focused on events relevant to the counterterrorism domain. Our classification framework combines PLP features with a convolutional architecture to train a scalable model on a large number of events that is later fine-tuned on the subset of interest. The fusion of different corpora was also investigated, showing the difficulties posed by this task. With our framework, results attained an average F1-score of 0.53% on the target set of events. Of relevance, during the fine-tune phase a general-purpose class was introduced, which allowed the model to generalize on ’unseen’ events, highlighting the importance of a robust fine-tune. |
Anna Pompili, Tiago Luís, Nuno Monteiro, João Miranda, Carlo Mendes and Sérgio Paulo | |
P1.15 12:00 – 13:30 |
Database dependence comparison in detection of physical access voice spoofing attacks (abs The antispoofing challenges are designed to work on a single database, on which we can test our model. The automatic speaker verification spoofing and countermeasures (ASVspoof) [1] challenge series is a community-led initiative that aims to promote the consideration of spoofing and the development of countermeasures. In general, the idea of analyzing the databases individually has been the dominant approach but this could be rather misleading. This paper provides a study of the )generalization capability of antispoofing systems based on neural networks by combining different databases for training and testing. We will try to give a broader vision of the advantages of grouping different datasets. We will delve into the ”replay attacks” on physical data. This type of attack is one of the most difficult to detect since only a few minutes of audio samples are needed to impersonate the voice of a genuine speaker and gain access to the ASV system. To carry out this task, the ASV databases from ASVspoof-challenge [2], [3],[4] have been chosen and will be used to have a more concrete and accurate vision of them. We report results on these databases using different neural network architectures and set-ups |
Manuel Chica, Alejandro Gomez-Alanis, Eros Rosello, Angel M. Gomez, Jose A. Gonzalez and Antonio M. Peinado | |
P1.16 12:00 – 13:30 |
Measuring trust at zero-acquaintance using acted-emotional videos (abs rustworthiness recognition attracts the attention of the research community due )to its main role in social communications. However, few datasets are available and there are still many dimensions of trust to investigate. This paper presents a study of an annotation tool for creating of a trustworthiness corpus. Specifically, we asked the participants to rate short emotional videos extracted from RAVDESS at zero acquaintance and studied the relationship between their trustworthiness score and other characteristics of the subjects of each video. Eloquence (ρ = 0.41), kindness (ρ = 0.32), attractiveness (ρ = 0.34), and authenticity of emotion transmitted (ρ = 0.6) are shown to be important determinants of perceived trustworthiness. In addition, we have measured a strong association between some of the variables under study. For example, physical beauty and voice pleasantness obtain a ρ = 0.71, or eloquence and expressiveness (ρ = 0.65), which opens a future line of investigation to study how people understand attractiveness and eloquence from these perspectives. Finally, an attribute selection strategy identified that frequency and spectral-related attributes could be accurate aural indicators of perceived trustworthiness. |
Cristina Luna Jiménez, Syaheerah Lebai Lutfi, Manuel Gil-Martín, Ricardo Kleinlein, Juan M. Montero and Fernando Fernández-Martínez |
Oral 2: Automatic Speech Recognition
Monday, 14 November 2022 (15:00-17:00)
Chair: Hermann Ney
O2.1 15:00 – 15:20 |
Galician’s Language Technologies in the Digital Age (abs This study was carried out under the initial state of the European Language )Equality project to report technology support for Europe’s languages. In this paper, we show an overview of the current state of automatic speech recognition technologies for Galician. In addition, we compare, over a small set of Galician TV shows, the performance of two of the most reported automatic recognition system with support for Galician: the one developed by the University of Vigo and the one offered by Google. Our research shows impressive growth in the amount of data and resources created for Galician in the last four years. However, the scope of the resources and the range of tools are still limited, especially in the actual context of services and technologies based on artificial intelligence and big data. The current state of support, resources, and tools for Galician makes it one of the European languages in danger of being left behind in the future. |
José Manuel Ramírez Sánchez, Laura Docio-Fernandez and Carmen Garcia Mateo | |
O2.2 15:20 – 15:40 |
Contextual-Utterance Training for Automatic Speech Recognition (abs Recent studies of streaming automatic speech recognition (ASR) recurrent neural )network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adapta- tion to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming ASR systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling “in-place” the knowledge of a teacher (non-streaming mode), which is able to see both past and future contextual utterances, to the student (streaming mode) which can only see the current and past contextual utterances. The experimental results show that a state-of-the-art conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40 ms relative, respectively. |
Alejandro Gomez-Alanis, Lukas Drude, Andreas Schwarz, Rupak Vignesh Swaminathan and Simon Wiesler | |
O2.3 15:40 – 16:00 |
Phone classification using electromyographic signals (abs Silent speech interfaces aim at generating speech from biosignals obtained from )the human speech production system. In order to provide resources for the development of these interfaces, language-specific databases are required. Several silent speech electromyography (EMG) databases for English exist. However, a database for the Spanish language had yet to be developed. The aim of this research is to validate the experimental design of the first silent speech EMG database for Spanish, namely the new ReSSInt-EMG database. The EMG signals in this database are obtained using eight surface EMG bipolar electrode pairs located in the face and neck and are recorded in parallel with either audible or silent speech. Phone classification experiments are performed, using a set of time-domain features typically used in related works. As a validation reference, the EMG-UKA Trial Corpus is used, which is the most commonly used silent speech EMG database for English. The results show an average test accuracy of 40.85% for ReSSInt- EMG, suggesting that the data acquisition procedure for the new database is valid. |
Eder Del Blanco, Inge Salomons, Eva Navas and Inma Hernáez | |
O2.4 16:00 – 16:20 |
Semisupervised training of a fully bilingual ASR system for Basque and Spanish (abs Automatic speech recognition (ASR) of speech signals with code-switching (an abrupt language change common in bilingual communities) typically requires spoken language recognition to get single-language segments. In this paper, we present a fully bilingual ASR system for Basque and Spanish which does not require such segmentation but naturally deals with both languages using a single set of acoustic )units and a single (aggregated) language model. We also present the Basque Parliament Database (BPDB) used for the experiments in this work. A semisupervised method is applied, which starts by training baseline acoustic models on small acoustic datasets in Basque and Spanish. These models are then used to perform phone recognition on the BPDB training set, for which only approximate transcriptions are available. A similarity score derived from the alignment of the nominal and recognized phonetic sequences is used to rank a set of training segments. Acoustic models are updated with those BPDB training segments for which the similarity score exceeds a heuristically fixed threshold. Using the updated models, Word Error Rate (WER) reduced from 16.46 to 6.99 on the validation set, and from 15.06 to 5.16 on the test set, meaning 57.5% and 65.74% relative WER reductions over baseline models, respectively. |
Mikel Penagarikano, Amparo Varona, German Bordel and Luis J. Rodriguez-Fuentes | |
O2.5 16:20 – 16:40 |
Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish (abs Different studies have shown the importance of visual cues throughout the speech perception process. In fact, the development of audiovisual approaches has led to advances in the field of speech technologies. However, although noticeable results have recently been achieved, visual speech recognition remains an open research problem. It is a task in which, by dispensing with the auditory sense, challenges such as visual ambiguities and the complexity of modeling silence must be faced. Nonetheless, some of these challenges can be alleviated when the problem is approached from a speaker-dependent perspective. Thus, this paper studies, using the Spanish LIPRTVE database, how the estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. First, different adaptation strategies based on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention architecture was used as a baseline throughout our experiments. Our findings showed that a two-step finetuning process, where the VSR system is first adapted to the task domain, provided significant improvements when the speaker adaptation was addressed. Furthermore, results comparable to the current state of the art were reached even when only a limited amount of data was available. ) |
David Gimeno-Gomez and Carlos David Martinez Hinarejos | |
O2.6 16:40 – 17:00 |
Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation (abs High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose a selfsupervised domain adaptation method, )based upon an iterative pseudo-forced alignment algorithm. The produced alignments are employed to customize an end-to-end Automatic Speech Recognition (ASR) and iteratively refined. The algorithm is fed with frame-wise character posteriors produced by a seed ASR, trained with out-of-domain data, and optimized throughout a Connectionist Temporal Classification (CTC) loss. The alignments are computed iteratively upon a corpus of broadcast TV. The process is repeated by reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible audio-text alignment. The starting timestamps, or tempo- ral anchors, are produced uniquely based on the confidence score of the last aligned utterance. This score is computed with the paths of the CTC-alignment matrix. With this methodology, no human-revised text references are required. Alignments from long audio files with low-quality transcriptions, like TV captions, are filtered out by confidence score and ready for further ASR adaptation. The obtained results, on both the Spanish RTVE2022 and CommonVoice databases, underpin the feasibil- ity of using CTC-based systems to perform: highly accurate audio-text alignments, domain adaptation and semi-supervised training of end-to-end ASR. |
Fernando López and Jordi Luque |
RTTH Assembly
Monday, 14 November 2022 (17:20-18:50)