Saltar al contenido
Portada » Technical Program, Day 1

Technical Program, Day 1

Monday, November 14

Oral 1: Speech Synthesis
Monday, 14 November 2022 (9:20-10:40)
Chair: Antonio Bonafonte

9:20  – 09:40
Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech (abs
We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS, as an enhancement to the well-known Variational Autoencoder (VAE) and Vector Quantized Variational Autoencoder (VQ-VAE) architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while keeping significant representation power and a discretized latent space small enough for efficient prediction from text. We train the model on recordings in the expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE latent acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%.
Marek Strelec, Jonas Rohnke, Antonio Bonafonte, Mateusz Lajszczak, Trevor Wood
09:40 – 10:00
An animated realistic head with vocal tract for the finite element simulation of vowel /a/ (abs
Three-dimensional (3D) acoustic models can accurately simulate the voice production mechanism. These models require detailed 3D vocal tract geometries through which sound waves propagate. A few open source databases typically based on magnetic resonance imaging (MRI) are already available in literature. However, the 3D geometries they contain are mainly focused on the vocal tract and remove the head, which limits the computational domain of the simulations. This work develops a unified model consisting of an MRI-based vocal tract geometry set in a realistic head. The head is generated from scratch based on anatomical data of another subject, and contains different layers that add an organic appearance to the character. It is then not only designed to allow accurate finite element simulations of vowels, but more importantly, it can also be animated to add a realistic visual layer to the generated sound. This is expected to help in the dissemination of results and also to open potential applications in the audiovisual and animation sector. This paper
shows the first results of the model focusing on the vowel /a/.
Marc Arnela, Leonardo Pereira-Vivas, Jorge Egea
10:00  – 10:20
Exploring the limits of neural voice cloning: A case study on two well-known personalities (abs
This work describes one successful and one failed Voice Cloning processes of two famous personalities in order to be broadcast in a high-impact podcast and in a Spanish public television program. Whilst a good quality synthesised voice could be generated for the first public figure, the second one was not adequate enough for its broadcast on television given its low speech quality. In this study, we explore the limits of the neural voice cloning considering the different conditions of the training material employed in each case and, based on several objective measures (amount of training data, phoneme coverage, SNR, MCD and PESQ), we analysed the main features to be considered for a high-quality synthetic voice generation. In addition, a webpage is provided in which samples of the resulting audios are available for each cloning model.
Ander González-Docasal, Aitor Álvarez, Haritz Arzelus
10:20  – 10:40
Analysis of iterative adaptive and quasi closed phase inverse filtering techniques on OPENGLOT synthetic vowels (abs
Three-dimensional source-filter models allow for the articulatory-based generation of voice but with limited expressiveness yet. From the analysis of expressive speech corpora through glottal inverse filtering techniques, it has been observed that both the vocal tract and the glottal source play a key role in the generation of different phonation types. However, the accuracy of the source-filter decomposition depends on the considered technique. Current Quasi Closed Phase (QCP) and Iterative Adaptive Inverse Filtering (IAIF) based approaches present pretty good results, despite difficult to compare as they are obtained from different experiments. This work aims at evaluating the performance of these stateof- the-art methods on the reference OPENGLOT database, using its repository with synthetic vowels generated with different phonation types and fundamental frequencies. After optimizing the parameters of each inverse filtering approach, their performance is compared considering typical glottal flow error measures. The results show that QCP-based techniques attain statistically significant lower values in most measures. IAIF variants achieve a significant improvement on the spectral tilt error measure with respect to the original IAIF, but they are surpassed by QCP when spectral tilt compensation is applied.
Marc Freixes, Joan Claudi Socoró and Francesc Alías


Keynote 1
Monday, 14 November 2022 (11:00-12:00)

11:00  – 12:00
Secure and explainable voice biometrics (abs
Anti-spoofing for voice biometrics is now an established area of research, thanks to the four competitive ASVspoof challenges (the fifth is currently underway) that have taken place over the past decade. Growing research effort has invested, firstly, in the development of front-end representations that capture more reliably the tell-tale artefacts that are indicative of utterances generated with text-to-speech and voice conversion algorithms and, secondly, in the development of deep and end-to-end solutions. Despite enormous efforts and positive achievements, little is still known about the artefacts these recognisers use to identify spoofing utterances or distinguish between bona fide and spoofed. Although many unanswered questions remain, this talk aims to provide insights and inspirations, through examples, into the behaviour of voice anti-spoofing systems. Particular attention will be given to data augmentation and boosting methods that have been shown instrumental to reliability. The ultimate goal is to better understand these artefacts from a physical and perceptual point of view and how they are actually seen by automatic processes, which puts us in a better position to design more reliable countermeasures.
Massimiliano Todisco


Posters 1: Topics on Speech and Language Technologies
Monday, 14 November 2022 (12:00-13:30)
Chair:  Antonio Teixeira

12:00 – 13:30
An Experimental Study on Light Speech Features for Small-Footprint Keyword Spotting (abs
Keyword spotting (KWS) is, in many instances, intended to run on smart electronic devices characterized by limited computational resources. To meet computational constraints, a series of techniques —ranging from feature and acoustic model parameter quantization to the reduction of the number of model parameters and required multiplications— has been explored in the literature. With this same aim, in this paper, we study a straightforward alternative consisting of the reduction of the spectro/cepstro-temporal resolution of log-Mel and Melfrequency cepstral coefficient feature matrices commonly employed in KWS. We show that the feature matrix size has a strong impact on the number of multiplications/energy consumption of a state-of-the-art KWS acoustic model based on convolutional neural network. Experimental results demonstrate that the number of elements in commonly used speech feature matrices can be reduced by a factor of 8 while essentially maintaining KWS performance. Even more interestingly, this size reduction leads to a 9.6× number of multiplications/energy consumption, 4.0× training time and 3.7× inference time reduction.
Iván López-Espejo, Zheng-Hua Tan and Jesper Jensen
12:00 – 13:30
S3prl-Disorder: Open-Source Voice Disorder Detection System based in the Framework of S3PRL-toolkit (abs
This paper introduces S3prl-Disorder, an open-source toolkit for Automatic Voice Disorder Detection (AVDD) developed in the framework of the S3prl toolkit. It focuses on a binary classification task between healthy and pathological speech in the Saarbruecken Voice Database (SVD). However, the framework left room for following extensions to multi-class classification to differentiate among pathologies and to incorporate more datasets. This work aims to contribute on the development of automatic systems for diagnosis, treatment, and monitoring of voice pathologies in a common framework, that allows reproducibility and comparability among systems and results.
Dayana Ribas, Miguel Angel Pastor Yoldi, Antonio Miguel, David Martínez, Alfonso Ortega and Eduardo Lleida
12:00 – 13:30
Active Learning Improves the Teacher’s Experience: A Case Study in a Language Grounding Scenario (abs
Active Learning, that is, assigning the responsibility of learning to the students, is an important tool in education as it makes the students become engaged in and think about the things they do. A similar concept was adopted in the context of Machine Learning as a means to reduce the annotation effort by selecting the examples that are most relevant or provide more information at a given time. Most studies on this subject focus on the learner’s performance. However, in interactive scenarios, the teacher’s experience is also a relevant aspect, as it affects their willingness to interact with artificial learners. In this paper, we address that aspect by performing a case study in a language grounding scenario, in which humans have to engage in dialog with a learning agent and teach it how to recognize observations of certain objects. Overall, the results of our experiments show that humans prefer to interact with an active learner, as it seems more intelligent, gives them a better perception of its knowledge, and makes the dialog more natural and enjoyable.
Filipe Reynaud, Eugénio Ribeiro and David Martins de Matos
12:00 – 13:30
The role of window length and shift in complex-domain DNN-based speech enhancement (abs
Deep learning techniques have widely been applied to speech enhancement as they show outstanding modeling capabilities that are needed for proper speech-noise separation. In contrast to other end-to-end approaches, masking-based methods consider speech spectra as input to the deep neural network, providing spectral masks for noise removal or attenuation. In these approaches, the Short-Time Fourier Transform (STFT) and, particularly, the parameters used for the analysis/synthesis window, plays an important role which is often neglected. In this paper, we analyze the effects of window length and shift on a complex-domain convolutional-recurrent neural network (DCCRN) which is able to provide, separately, magnitude and phase corrections. Different perceptual quality and intelligibility objective metrics are used to assess its performance. As a result, we have observed that phase correc-
tions have an increased impact with shorter window sizes. Similarly, as window overlap increases, phase takes more relevance than magnitude spectrum in speech enhancement.
Celia García-Ruiz, Angel M. Gomez and Juan M. Martín-Doñas
12:00 – 13:30
Neural Detection of Cross-lingual Syntactic Knowledge (abs
In recent years, there has been prominent development in pretrained multilingual
language models, such as mBERT, XLMR, etc., which are able to capture and
learn linguistic knowledge from input across a variety of languages simultaneously.
However, little is known about where multilingual models localise what they have
learnt across languages. In this paper, we specifically evaluate cross-lingual syntactic
information embedded in CINO, a more recent multilingual pre-trained language
model. We probe CINO on Universal Dependencies treebank datasets of English
and Chinese Mandarin for two syntax-related layerwise evaluation tasks: Part-of-
Speech Tagging at token level and Syntax Tree-depth Prediction at sentence level.
The results of our layer-wise probing experiments show that token-level syntax is
localisable in higher layers and consistency is shown across the typologically different
languages, whereas sentencelevel syntax is distributed across the layers in typology-
specific and universal manners.
Yongjian Chen and Mireia Farrús
12:00 – 13:30
Efficient Transformers for End-to-End Neural Speaker Diarization (abs
The recently proposed End-to-End Neural speaker Diarization framework (EEND)
handles speech overlap and speech activity detection natively. While extensions of this work have reported remarkable results in both two-speaker and multispeaker
diarization scenarios, these come at the cost of a long training process that requires considerable memory and computational power. In this work, we explore
the integration of efficient transformer variants into the Self-Attentive EEND with Encoder-Decoder based Attractors (SA-EEND EDA) architecture. Since it is based on Transformers, the cost of training SA-EEND EDA is driven by the quadratic time and memory complexity of their self-attention mechanism. We verify that the use of a linear attention mechanism in SA-EEND EDA decreases GPU memory usage by 22%. We conduct experiments to measure how the increased efficiency of the training process translates into the two-speaker diarization error rate on CALLHOME, quantifying the impact of increasing the size of the batch, the model or the sequence
length on training time and diarization performance. In addition, we propose an architecture combining linear and softmax attention that achieves an acceleration
of 12% with a small relative DER degradation of 2%, while using the same GPU memory as the softmax attention baseline.
Sergio Izquierdo del Alamo, Beltrán Labrador, Alicia Lozano-Diez and Doroteo T. Toledano
12:00 – 13:30
CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech (abs
With the advent of technology, the availability of linguistic data in digital format has been increasingly encouraged to facilitate its use not only in different areas of
Linguistics but also in related areas, such as natural language processing. Inspired by a protocol for digitizing the NURC (‘Cultured Linguistic Urban Norm’) project
collection — one of the most influential in Brazilian Linguistics —, this paper aims to present the text-to-speech alignment process of the NURC-Sao Paulo Minimal ̃ Corpus. This subcorpus comprises 21 audio files and audioaligned multilevel transcripts according to linguistically motivated intonation units (18 hours, 155 k words),
covering three text genres. The dataset — currently used to evaluate methods for processing the entire NURC-SP corpus — is publicly available on the Portulan Clarin
repository [CC BY-NC-ND 4.0] (
Vinícius G. Santos, Caroline Adriane Alves, Bruno Baldissera Carlotto, Bruno Angelo Papa Dias, Lucas Rafael Stefanel Gris, Renan de Lima Izaias, Maria Luiza Azevedo de Morais, Paula Marin de Oliveira, Rafael Sicoli, Flaviane Romani Fernandes Svartman, Marli Quadros Leite and Sandra Maria Aluísio
12:00 – 13:30
Speaker Characterization by means of Attention Pooling (abs
State-of-the-art Deep Learning systems for speaker verification are commonly
based on speaker embedding extractors. These architectures are usually composed of
a feature extractor front-end together with a pooling layer to encode variable length
utterances into fixed-length speaker vectors. The authors have recently proposed the
use of a Double Multi-Head SelfAttention pooling for speaker recognition, placed
between a CNN-based front-end and a set of fully connected layers. This has shown
to be an excellent approach to efficiently select the most relevant features captured
by the front-end from the speech signal. In this paper we show excellent experimental
results by adapting this architecture to other different speaker characterization tasks,
such as emotion recognition, sex classification and COVID-19 detection.
Federico Costa, Miquel India and Javier Hernando
12:00 – 13:30
Enhancing the Design of a Conversational Agent for an Ethical Interaction with Children (abs
Conversational agents (CAs) have become one of the most popular applications of speech and language technologies in the last decade. Those agents employ speech interaction to perform several tasks, from information retrieval to purchase goods from on-line stores. However, these agents are defined to address a general sector of population, mainly adults without speech production problems, and then they fail to obtain a similar performance with specific groups, such as elderly or children. The case of children is particularly interesting because they naturally engage in
interaction with those CAs and they have special needs in terms of technical and ethical considerations. Therefore, CAs must fulfil some conditions that could affect their general design in order to provide a trustworthy interaction with children. In this article we present how to improve a general CA design to fulfil the specific ethical needs of children interaction. We address the development of a CA devoted to complete a wish list of games using user preferences, and its improvements towards children.
Marina Escobar-Planas, Emilia Gómez, Carlos-D. Martínez-Hinarejos
12:00 – 13:30
Sentiment Analysis in Portuguese Dialogues (abs
Sentiment analysis in dialogue aims at detecting the sentiment expressed in the
utterances of a conversation, which may improve human-computer interaction in natural language. In this paper, we explore different approaches for sentiment analysis
in written Portuguese dialogues, mainly related to customer support in Telecommunications. If integrated into a conversational agent, this will enable the automatic
identification and a quick reaction upon clients manifesting negative sentiments, possibly with human intervention, hopefully minimising the damage. Experiments
were performed in two manually annotated real datasets: one with dialogues from the call-center of a Telecommunications company (TeleComSA); another of Twitter
conversations primarily involving accounts of Telecommunications companies. We compare the performance of different machine learning approaches, from traditional
to more recent, with and without considering previous utterances. The Finetuned BERT achieved the highest F1 Scores in both datasets, 0.87 in the Twitter dataset,
without context, and 0.93 in the TeleComSA, considering context. These are interesting results and suggest that automated customer-support may benefit from
sentiment detection. Another interesting finding was that most models did not benefit from using previous utterances, suggesting that, in this scenario, context does
not contribute much, and classifying the current utterance can be enough.
Isabel Carvalho, Hugo Gonçalo Oliveira and Catarina Silva
12:00 – 13:30
On the application of conformers to logical access voice spoofing attack detection (abs
Biometric systems are exposed to spoofing attacks which may compromise their security, and automatic speaker verification (ASV) is no exception. To increase
the robustness against such attacks, anti-spoofing systems have been proposed for
the detection of spoofed audio attacks. However, most of these systems can not
capture long-term feature dependencies and can only extract local features. While
transformers are an excellent solution for the exploitation of these long-distance
correlations, they may degrade local details. On the contrary, convolutional neural
networks (CNNs) are a powerful tool for extracting local features but not so much
for capturing global representations. The conformer is a model that combines the
best of both techniques, CNNs and transformers, to model both local and global
dependencies and has been used for speech recognition achieving state-of-the-art
performance. While conformers have been mainly applied to sequence-to-sequence
problems, in this work we make a preliminary study of their adaptation to a bi-
nary classification task such as anti-spoofing, with focus on synthesis and voice-
conversion-based attacks. To evaluate our proposals, experiments were carried out
on the ASVspoof 2019 logical access database. The experimental results show that
the proposed system can obtain encouraging results, although more research will be
required in order to outperform other state-of-the-art systems.
Eros Rosello, Alejandro Gomez-Alanis, Manuel Chica, Angel M. Gomez, Jose A. Gonzalez and Antonio M. Peinado
12:00 – 13:30
Speech emotion recognition in Spanish TV Debates (abs
Emotion recognition from speech is an active field of study that can help build
more natural human–machine interaction systems. Even though the advancement
of deep learning technology has brought improvements in this task, it is still a very
challenging field. For instance, when considering real life scenarios, things such
as tendency toward neutrality or the ambiguous definition of emotion can make
labeling a difficult task causing the data-set to be severally imbalanced and not
very representative. In this work we considered a real life scenario to carry out a
series of emotion classification experiments. Specifically, we worked with a labeled
corpus consisting of a set of audios from Spanish TV debates and their respective
transcriptions. First, an analysis of the emotional information within the corpus
was conducted. Then different data representations were analyzed as to choose
the best one for our task; Spectrograms and UniSpeech-SAT were used for audio
representation and DistilBERT for text representation. As a final step, Multimodal
Machine Learning was used with the aim of improving the obtained classification
results by combining acoustic and textual information.
Irune Zubiaga, Raquel Justo, M. Inés Torres and Mikel De Velasco
12:00 – 13:30
Assessing Transfer Learning and automatically annotated data in the development of Named Entity Recognizers for new domains (abs
With recent advances Deep Learning, pretrained models and Transfer Learning, the lack of labeled data has become the biggest bottleneck preventing use of Named Entity Recognition (NER) in more domains and languages. To relieve the pressure of costs and time in the creation of annotated data for new domains, we proposed recently automatic annotation by an ensemble of NERs to get data to train a Bidirectional Encoder Representations from Transformers (BERT) based NER for Portuguese and made a first evaluation. Results demonstrated the method has potential but were limited to one domain. Having as main objective a more indepth assessment of the method capabilities, this paper presents: (1) evaluation of the method in other domains; (2) assessment of the generalization capabilities of the trained models, by applying them to new domains without retraining; (3) assessment of additional training with in-domain data, also automatically annotated. Evaluation, performed using the test part of MiniHAREM, Paramopama and LeNER
Portuguese datasets, confirmed the potential of the approach and demonstrated the capability of models previously trained for tourism domain to recognize entities in
new domains, with better performance for entities of types PERSON, LOCAL and ORGANIZATION.
Emanuel Matos, Mário Rodrigues and António Teixeira
12:00 – 13:30
On the detection of acoustic events for public security: the challenges of the counter-terrorism domain (abs
Massive amounts of audio-visual contents are shared in public platforms everyday. These contents are created with many purposes, from entertaining or teaching,
to extremist propaganda. Civil security actors need to monitor these platforms to detect and neutralize security threats. Generating actionable knowledge from multimedia contents requires the extraction of multiple information, from linguistic data to sounds and background noises. Information extraction demands audio-visual an-
notations, a costly, time-consuming task when performed manually, which hinders the analysis of such an overwhelming amount of data. This work, performed in the
context of the EU Horizon 2020 Project AIDA, addresses the challenge of building a robust sound detector focused on events relevant to the counterterrorism domain.
Our classification framework combines PLP features with a convolutional architecture to train a scalable model on a large number of events that is later fine-tuned on
the subset of interest. The fusion of different corpora was also investigated, showing the difficulties posed by this task. With our framework, results attained an average F1-score of 0.53% on the target set of events. Of relevance, during the fine-tune phase a general-purpose class was introduced, which allowed the model to generalize
on ’unseen’ events, highlighting the importance of a robust fine-tune.
Anna Pompili, Tiago Luís, Nuno Monteiro, João Miranda, Carlo Mendes and Sérgio Paulo
12:00 – 13:30
Database dependence comparison in detection of physical access voice spoofing attacks (abs
The antispoofing challenges are designed to work on a single database, on which we can test our model. The automatic speaker verification spoofing and countermeasures (ASVspoof) [1] challenge series is a community-led initiative that aims to promote the consideration of spoofing and the development of countermeasures. In general, the idea of analyzing the databases individually has been the dominant approach but this could be rather misleading. This paper provides a study of the
generalization capability of antispoofing systems based on neural networks by combining different databases for training and testing. We will try to give a broader
vision of the advantages of grouping different datasets. We will delve into the ”replay attacks” on physical data. This type of attack is one of the most difficult to
detect since only a few minutes of audio samples are needed to impersonate the
voice of a genuine speaker and gain access to the ASV system. To carry out this
task, the ASV databases from ASVspoof-challenge [2], [3],[4] have been chosen and
will be used to have a more concrete and accurate vision of them. We report results
on these databases using different neural network architectures and set-ups
Manuel Chica, Alejandro Gomez-Alanis, Eros Rosello, Angel M. Gomez, Jose A. Gonzalez and Antonio M. Peinado
12:00 – 13:30
Measuring trust at zero-acquaintance using acted-emotional videos (abs
rustworthiness recognition attracts the attention of the research community due
to its main role in social communications. However, few datasets are available and
there are still many dimensions of trust to investigate. This paper presents a study
of an annotation tool for creating of a trustworthiness corpus. Specifically, we asked
the participants to rate short emotional videos extracted from RAVDESS at zero
acquaintance and studied the relationship between their trustworthiness score and
other characteristics of the subjects of each video. Eloquence (ρ = 0.41), kindness
(ρ = 0.32), attractiveness (ρ = 0.34), and authenticity of emotion transmitted
(ρ = 0.6) are shown to be important determinants of perceived trustworthiness. In
addition, we have measured a strong association between some of the variables under
study. For example, physical beauty and voice pleasantness obtain a ρ = 0.71, or
eloquence and expressiveness (ρ = 0.65), which opens a future line of investigation to
study how people understand attractiveness and eloquence from these perspectives.
Finally, an attribute selection strategy identified that frequency and spectral-related
attributes could be accurate aural indicators of perceived trustworthiness.
Cristina Luna Jiménez, Syaheerah Lebai Lutfi, Manuel Gil-Martín, Ricardo Kleinlein, Juan M. Montero and Fernando Fernández-Martínez


Oral 2: Automatic Speech Recognition 
Monday, 14 November 2022 (15:00-17:00)
Chair: Hermann Ney

15:00  – 15:20
Galician’s Language Technologies in the Digital Age  (abs
This study was carried out under the initial state of the European Language
Equality project to report technology support for Europe’s languages. In this paper, we show an overview of the current state of automatic speech recognition technologies for Galician. In addition, we compare, over a small set of Galician TV shows, the performance of two of the most reported automatic recognition system
with support for Galician: the one developed by the University of Vigo and the one offered by Google. Our research shows impressive growth in the amount of data
and resources created for Galician in the last four years. However, the scope of the resources and the range of tools are still limited, especially in the actual context of
services and technologies based on artificial intelligence and big data. The current state of support, resources, and tools for Galician makes it one of the European
languages in danger of being left behind in the future.
José Manuel Ramírez Sánchez, Laura Docio-Fernandez and Carmen Garcia Mateo
15:20  – 15:40
Contextual-Utterance Training for Automatic Speech Recognition (abs
Recent studies of streaming automatic speech recognition (ASR) recurrent neural
network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this
paper, we first propose a contextual-utterance training technique which makes use
of the previous and future contextual utterances in order to do an implicit adapta-
tion to the speaker, topic and acoustic environment. Also, we propose a dual-mode
contextual-utterance training technique for streaming ASR systems. This proposed
approach allows to make a better use of the available acoustic context in streaming
models by distilling “in-place” the knowledge of a teacher (non-streaming mode),
which is able to see both past and future contextual utterances, to the student
(streaming mode) which can only see the current and past contextual utterances.
The experimental results show that a state-of-the-art conformer-transducer system
trained with the proposed techniques outperforms the same system trained with the
classical RNN-T loss. Specifically, the proposed technique is able to reduce both
the WER and the average last token emission latency by more than 6% and 40 ms
relative, respectively.
Alejandro Gomez-Alanis, Lukas Drude, Andreas Schwarz, Rupak Vignesh Swaminathan and Simon Wiesler
15:40  – 16:00
Phone classification using electromyographic signals (abs
Silent speech interfaces aim at generating speech from biosignals obtained from
the human speech production system. In order to provide resources for the development of these interfaces, language-specific databases are required. Several silent
speech electromyography (EMG) databases for English exist. However, a database for the Spanish language had yet to be developed. The aim of this research is to
validate the experimental design of the first silent speech EMG database for Spanish, namely the new ReSSInt-EMG database. The EMG signals in this database
are obtained using eight surface EMG bipolar electrode pairs located in the face and neck and are recorded in parallel with either audible or silent speech. Phone
classification experiments are performed, using a set of time-domain features typically used in related works. As a validation reference, the EMG-UKA Trial Corpus
is used, which is the most commonly used silent speech EMG database for English. The results show an average test accuracy of 40.85% for ReSSInt- EMG, suggesting
that the data acquisition procedure for the new database is valid.
Eder Del Blanco, Inge Salomons, Eva Navas and Inma Hernáez
16:00  – 16:20
Semisupervised training of a fully bilingual ASR system for Basque and Spanish (abs
Automatic speech recognition (ASR) of speech signals with code-switching (an abrupt language change common in bilingual communities) typically requires spoken language recognition to get single-language segments. In this paper, we present a fully bilingual ASR system for Basque and Spanish which does not require such segmentation but naturally deals with both languages using a single set of acoustic
units and a single (aggregated) language model. We also present the Basque Parliament Database (BPDB) used for the experiments in this work. A semisupervised
method is applied, which starts by training baseline acoustic models on small acoustic datasets in Basque and Spanish. These models are then used to perform phone
recognition on the BPDB training set, for which only approximate transcriptions are available. A similarity score derived from the alignment of the nominal and
recognized phonetic sequences is used to rank a set of training segments. Acoustic models are updated with those BPDB training segments for which the similarity
score exceeds a heuristically fixed threshold. Using the updated models, Word Error Rate (WER) reduced from 16.46 to 6.99 on the validation set, and from 15.06
to 5.16 on the test set, meaning 57.5% and 65.74% relative WER reductions over baseline models, respectively.
Mikel Penagarikano, Amparo Varona, German Bordel and Luis J. Rodriguez-Fuentes
16:20  – 16:40
Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish (abs
Different studies have shown the importance of visual cues throughout the speech perception process. In fact, the development of audiovisual approaches has led to advances in the field of speech technologies. However, although noticeable results have recently been achieved, visual speech recognition remains an open research problem. It is a task in which, by dispensing with the auditory sense, challenges such as visual ambiguities and the complexity of modeling silence must be faced. Nonetheless, some of these challenges can be alleviated when the problem is approached from a speaker-dependent perspective. Thus, this paper studies, using the Spanish LIPRTVE database, how the estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. First, different adaptation strategies based on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention architecture was used as a baseline throughout our experiments. Our findings showed that a two-step finetuning process, where the VSR system is first adapted to the task domain, provided significant improvements when the speaker adaptation was addressed. Furthermore, results comparable to the current state of the art were reached even when only a limited amount of data was available.
David Gimeno-Gomez and Carlos David Martinez Hinarejos
16:40  – 17:00
Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation (abs
High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose a selfsupervised domain adaptation method,
based upon an iterative pseudo-forced alignment algorithm. The produced alignments are employed to customize an end-to-end Automatic Speech Recognition
(ASR) and iteratively refined. The algorithm is fed with frame-wise character posteriors produced by a seed ASR, trained with out-of-domain data, and optimized
throughout a Connectionist Temporal Classification (CTC) loss. The alignments are computed iteratively upon a corpus of broadcast TV. The process is repeated by
reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible audio-text alignment. The starting timestamps, or tempo-
ral anchors, are produced uniquely based on the confidence score of the last aligned utterance. This score is computed with the paths of the CTC-alignment matrix.
With this methodology, no human-revised text references are required. Alignments from long audio files with low-quality transcriptions, like TV captions, are filtered
out by confidence score and ready for further ASR adaptation. The obtained results, on both the Spanish RTVE2022 and CommonVoice databases, underpin the feasibil-
ity of using CTC-based systems to perform: highly accurate audio-text alignments, domain adaptation and semi-supervised training of end-to-end ASR.
Fernando López and Jordi Luque


RTTH Assembly
Monday, 14 November 2022 (17:20-18:50)