Saltar al contenido
Portada » Technical Program, Day 2

Technical Program, Day 2

Tuesday, 15 November

Oral 3: Speech and Audio Processing
Tuesday, 15 November 2022 (9:00-10:40)
Chair: Alfonso Ortega

09:00  – 09:20
On the potential of jointly-optimised solutions to spoofing attack detection and automatic speaker verification (abs
The spoofing-aware speaker verification (SASV) challenge was designed to promote the study of jointly-optimised solutions to accomplish the traditionally separately-
optimised tasks of spoofing detection and speaker verification. Jointly-optimised systems have the potential to operate in synergy as a better performing solution
to the single task of reliable speaker verification. However, none of the 23 submissions to SASV 2022 are jointly optimised. We have hence sought to determine why
separately-optimised sub-systems perform best or why joint optimisation was not successful. Experiments reported in this paper show that joint optimisation is suc-
cessful in improving robustness to spoofing but that it degrades speaker verification performance. The findings suggest that spoofing detection and speaker verification
sub-systems should be optimised jointly in a manner which reflects the differences in how information provided by each sub-system is complementary to that provided
by the other. Progress will also likely depend upon the collection of data from a larger number of speakers.
Wanying Ge, Hemlata Tak, Massimiliano Todisco and Nicholas Evans
09:20  – 09:40
A Study on the Use of wav2vec Representations for Multiclass Audio Segmentation (abs
This paper presents a study on the use of new unsupervised representations through wav2vec models seeking to jointly model speech and music fragments of audio signals in a multiclass audio segmentation task. Previous studies have already described the capabilities of deep neural networks in binary and multiclass audio segmentation tasks. Particularly, the separation of speech, music and noise signals through audio segmentation shows competitive results using a combination of perceptual and musical features as input to a neural network. Wav2vec representations have been successfully applied to several speech processing applications. In this study, they are considered for the multiclass audio segmentation task presented in the Albayz ́ın 2010 evaluation. We compare the use of different representations obtained through unsupervised learning with our previous results in this database using a traditional set of features under different conditions. Experimental results show that wav2vec representations can improve the performance of audio segmentation systems for classes containing speech, while showing a degradation in the segmentation of isolated music. This trend is consistent among all experiments developed. On average, the use of unsupervised representation learning leads to a relative improvement close to 6.8% on the segmentation task.
Pablo Gimeno, Alfonso Ortega, Antonio Miguel and Eduardo Lleida
09:40  – 10:00
Respiratory Sound Classification Using an Attention LSTM Model with Mixup Data Augmentation (abs
Auscultation is the most common method for the diagnosis of respiratory diseases, although it depends largely on the physician’s ability. In order to alleviate this drawback, in this paper, we present an automatic system capable of distinguishing between different types of lung sounds (neutral, wheeze, crackle) in patient’s respiratory recordings. In particular, the proposed system is based on Long Short Term-Memory (LSTM) networks fed with log-mel spectrograms, on which several improvements have been developed. Firstly, the frequency bands that contain more useful information have been experimentally determined in order to enhance the input acoustic features. Secondly, an Attention Mechanism has been incorporated into the LSTM model in order to emphasize the more relevant audio frames to the task under consideration. Finally, a Mixup data augmentation technique has been adopted in order to mitigate the problem of data imbalance and improve the sensitivity of the system. The proposed methods have been evaluated over the publicly available ICBHI 2017 dataset, achieving good results in comparison to the baseline.
Noelia Salor-Burdalo and Ascension Gallardo-Antolin
10:00  – 10:20
The Vicomtech Spoofing-Aware Biometric System for the SASV Challenge (abs
This paper describes our proposed system for the spoofingaware speaker verification challenge (SASV Challenge 2022). The system follows an integrated approach that uses speaker verification and antispoofing embeddings extracted from specialized neural networks. Firstly, a shallow neural network, fed with the test utterance’s verification and spoofing embeddings, is used to compute a spoof-based score. The final scoring decision is then obtained by combining this score with the cosine similarity between speaker verification embeddings. The integration network was trained
using a one-class loss to discriminate between target and unauthorized trials. Our proposed system is evaluated over the ASVspoof19 database and shows competitive
performance compared to other integration approaches. In addition, we compare our approach with further state-of-theart speaker verification and antispoofing systems based on selfsupervised learning, yielding high-performance speech biometric systems comparable with the best challenge submissions.
Juan Manuel Martín-Doñas, Iván González Torre, Aitor Álvarez and Joaquin Arellano
10:20  – 10:40
VoxCeleb-PT – a dataset for a speech processing course (abs
This paper introduces VoxCeleb-PT, a small dataset of voices of Portuguese
celebrities that can be used as a language-specific extension of the widely used Vox-Celeb corpus. Besides introducing the corpus, we also describe three lab assignments
where it was used in a one-semester speech processing course: age regression, speaker verification and speech recognition, hoping to highlight the relevance of this dataset
as a pedagogical tool. Additionally, this paper confirms the overall limitations of current systems when evaluated in different languages and acoustic conditions: we found an overall degradation of performance on all of the proposed tasks.
John Mendonca and Isabel Trancoso


Keynote 2
Tuesday, 15 November 2022 (11:00-12:00)

11:00  – 12:00
Disease biomarkers in speech (abs
Speech encodes information about a plethora of diseases, which go beyond the so-called speech and language disorders, and include neurodegenerative diseases, such as Parkinson’s, Alzheimer’s, and Huntington’s disease, mood and anxiety-related diseases, such as Depression and Bipolar Disease, and diseases that concern respiratory organs such as the common Cold, or Obstructive Sleep Apnea. This talk addresses the potential of speech as a health biomarker which allows a non-invasive route to early diagnosis and monitoring of a range of conditions related to human physiology and cognition. The talk will also address the many challenges that lie ahead, namely in the context of an ageing population with frequent multimorbidity, and the need to build robust models that provide explanations compatible with clinical reasoning. That would be a major step towards a future where collecting speech samples for health screening may become as common as a blood test nowadays. Speech can indeed encode health information au par with many other characteristics that make it viewed as Personal Identifiable Information. The last part of this talk will briefly discuss the privacy issues that this enormous potential may entail.
Isabel Trancoso


Posters 2: Special Sessions
Tuesday, 15 November 2022 (12:00-13:30)
Chair:  Inma Hernáez

Ph.D. Thesis

12:00 – 13:30
Representation and Metric Learning Advances for Deep Neural Network Face and Speaker Biometric Systems (abs
  Nowadays, the use of technological devices and face and speaker biometric recognition systems are becoming increasingly common in people daily lives. This fact has motivated a great deal of research interest in the development of effective and robust systems. However, although face and voice recognition systems are mature technologies, there are still some challenges
which need further improvement and continued research when Deep Neural Networks (DNNs) are employed in these systems. In this manuscript, we present an overview of the main findings of Victoria Mingote’s Thesis where different approaches to address these issues are proposed. The advances
presented are focused on two streams of research. First, in the representation learning part, we propose several approaches to obtain robust representations of the signals for text-dependent speaker verification systems. While in the metric learning part, we focus on introducing new loss functions to train DNNs directly to optimize the goal task for text-dependent speaker, language and face verification and also multimodal diarization.
Victoria Mingote and Antonio Miguel
12:00 – 13:30
Voice Biometric Systems based on Deep Neural Networks: A Ph.D. Thesis Overview (abs
Voice biometric systems based on automatic speaker verification (ASV) are exposed to spoofing attacks which may compromise their security. To increase the robustness against such attacks, anti-spoofing systems have been proposed for the detection of replay, synthesis and voice conversion based attacks. This paper summarizes the work carried out for the first author’s PhD Thesis, which focused on the development of robust biometric systems which are able to detect zero-effort, spoofing and adversarial attacks. First, we propose a gated recurrent convolutional neural network (GRCNN) for detecting both logical and physical access spoofing attacks. Second, we propose a new loss function for training neural networks classifiers based on a probabilistic framework known as kernel density estimation (KDE). Third, we propose a top-performing integration of ASV and anti-spoofing systems with a new loss function which tries to optimize the whole voice biometric system on an expected
range of operating points. Finally, we propose a generative adversarial network (GAN) for generating adversarial spoofing attacks in order to use them as a defense for building higher robust voice biometric systems. Experimental results show that the proposed techniques outperform many other state-of-the-art systems trained and evaluated in the same conditions with standard public datasets.
Alejandro Gomez-Alanis, Jose Andres Gonzalez-Lopez and Antonio Miguel Peinado Herreros
12:00 – 13:30
Online Multichannel Speech Enhancement combining Statistical Signal Processing and Deep Neural Networks: A Ph.D. Thesis Overview (abs
Speech-related applications on mobile devices require highperformance speech enhancement algorithms to tackle challenging, noisy real-world environments. In addition, current mobile devices often embed several microphones, allowing them to exploit spatial information. The main goal of this Thesis is the development of online multichannel speech enhancement algorithms for speech services in mobile devices. The proposed techniques use multichannel signal processing to increase the noise reduction performance without degrading the quality of  the speech signal. Moreover, deep neural networks are applied in specific parts of the algorithm where modeling by classical methods would be, otherwise, unfeasible or very limiting. Our
contributions focus on different noisy environments where these mobile speech technologies can be applied. These include dualmicrophone smartphones in noisy and reverberant environments
and general multi-microphone devices for speech enhancement and target source separation. Moreover, we study the training of deep learning methods for speech processing using perceptual considerations. Our contributions successfully integrate signal processing and deep learning methods to exploit spectral, spatial, and temporal speech features jointly. As a result, the proposed techniques provide us with a manifold framework for robust speech processing under very challenging acoustic environments, thus allowing us to improve perceptual quality and intelligibility measures.
Juan Manuel Martín-Doñas, Antonio M. Peinado and Angel M. Gomez


Research and Development Projects

12:00 – 13:30
ReSSInt project: voice restoration using Silent Speech Interfaces  (abs
ReSSInt is a project funded by the Spanish Ministry of Science and Innovation aiming at investigating the use of Silent  speech interfaces (SSIs) for restoring communication to individuals who have been deprived of the ability to speak. These interfaces capture non-acoustic biosignals generated during the speech production process and use them to predict the intended message. In the project two different biosignals are being investigated:  electromyography (EMG) signals representing electrical activity driving the facial muscles and intracraneal electroencephalography (iEEG) neural signals captured by means of invasive electrodes implanted on the brain. From the whole spectrum of speech disorders which may affect a person’s voice, ReSSInt will address two particular conditions: (i) voice loss after total laryngectomy and (ii) neurodegenerative diseases and other traumatic injuries which may leave an individual paralyzed and, eventually, unable to speak. In this paper we describe the current status of the project as well as the problems and difficulties encountered in its development.
Inma Hernaez, Jose Andres Gonzalez Lopez, Eva Navas, Jose Luis Pérez Córdoba, Ibon Saratxaga, Gonzalo Olivares, Jon Sanchez de la Fuente, Alberto Galdón, Victor Garcia, Jesús del Castillo, Inge Salomons and Eder del Blanco Sierra
12:00 – 13:30
ELE Project: an overview of the desk research (abs
This paper provides an overview of the European Language Equality (ELE) project. The main objective of ELE is to prepare
the European Language Equality program in the form of
a strategic research and innovation agenda that may be utilized
as a road map for achieving full digital language equality in
Europe by 2030. The desk research phase of ELE concentrated
on the systematic collection and analysis of the existing international,
national, and regional strategic research agendas, studies,
reports, and initiatives related to language technology and LTrelated
artificial intelligence. A brief survey of the findings is
presented here, with a special focus on the Spanish ecosystem.
Itziar Aldabe, Aritz Farwell, Eva Navas, Inma Hernaez, German Rigau
12:00 – 13:30
Snorble: An Interactive Children Companion  (abs
This paper presents an interactive companion called Snorble, created to engage with children and promote the development of healthy habits under the Snorble project.
Snorble is a smart companion capable of having a conversation with children, playing games, and helping them to go to sleep, all made possible thanks to speech recognition.
Mike Rizkalla, Thomas Chan, Emilio Granell, Chara Tsoukala, Aitor Carricondo, Carlos Bailon, María Teresa González and Vicent Alabau
12:00 – 13:30
Fusion of Classical Digital Signal Processing and Deep Learning methods (FTCAPPS)  (abs
The use of deep learning approaches in Signal Processing is
finally showing a trend towards a rational use. After an effervescent
period where research activity seemed to focus on
seeking old problems to apply solutions entirely based on neural
networks, we have reached a more mature stage where integrative
approaches are on the rise. These approaches gather
the best from each paradigm: on the one hand, the knowledge
and elegance of classical signal processing and, on the other,
the great ability to model and learn from data which is inherent
to deep learning methods. In this project we aim towards a new
signal processing paradigm where classical and deep learning
techniques not only collaborate, but fuse themselves. In particular,
we focus on two objectives: 1) the development of deep
learning architectures based on or inspired by signal processing
schemes, and 2) the improvement of current deep learning training
methods by means of classical techniques and algorithms,
particularly, by exploiting the knowledge legacy they treasure.
These innovations will be applied to two socially and scientifically
relevant topics in which our research group has been working
for years. The first one is the enhancement of speech signal
acquired under acoustic adverse conditions (e.g., noise, reverberation,
other speakers, …). The second one is the development
of anti-fraud measures for biometric voice authentication,
in which banking corporations and other large companies are
strongly interested.
Angel M. Gómez, Victoria E. Sanchez, Antonio M. Peinado, Juan M. Martín-Doñas, Alejandro Gómez-Alanis, Amelia Villegas-Morcillo, Eros Rosello, Manuel Chica, Celia García and Ivan López-Espejo
12:00 – 13:30
Spanish Lipreading in Realistic Scenarios: the LLEER project   (abs
Automatic speech recognition has been usually performed by
using only the audio data, but speech communication is affected
as well by other non-audio sources, mainly visual cues. Visual
information includes body expression, face expression, and lip
movements, among other. Lip reading, also known as Visual
Speech Recognition, aims at decoding speech by only using
the image of the lip movements. Current approaches for automatic
lip reading follow the same lines than for speech processing:
use of massive data for training deep learning models
that allow to perform speech recognition. However, most of the
datasets and models are devoted to languages such as English
or Chinese, while other languages, particularly Spanish, are underrepresented.
The LLEER (Lectura de Labios en Espa˜nol en
Escenarios Realistas) project aims at the acquisition of largescale
visual corpora for Spanish lip reading, the development
of visual processing techniques that allow to extract important
information for the task, the implementation of models for automatic
lip reading, and the integration with speech recognition
models for audiovisual speech recognition.
Carlos David Martinez Hinarejos, David Gimeno-Gomez, Francisco Casacuberta, Emilio Granell, Roberto Paredes, Moisés Pastor and Enrique Vidal
12:00 – 13:30
Clinical Applications of Neuroscience: Locating Language Areas in Epileptic Patients and Restoring Speech in Paralyzed People  (abs
The goal of this project is to study the neurological bases of
language using intracranial electroencephalography (iEEG) signals
recorded from drug-resistant epilepsy patients. In particular,
we aim to address two current clinical challenges. Firstly,
we intend to individually identify the brain regions involved
in the production and understanding of language, in order to
preserve these regions during brain surgery for epilepsy treatment.
Secondly, this project also aims to develop novel pattern
recognition algorithms that can decode speech from iEEG signals
obtained from participants performing language production
tasks. The ultimate goal is to evaluate the feasibility of a neuroprosthetic
device that could restore oral communication in persons
that cannot speak following a neurodegenerative disease
or brain damage. For both goals, a series of experimental tasks
will be developed in order to thoroughly evaluate language production
and comprehension. Furthermore, data derived from
these tasks will be analyzed using state-of-the-art multivariate
statistical methods and machine learning techniques (e.g., deep
learning). In addition to having a social impact, the results of
this project will also help in advancing the knowledge about the
neural substrates that underpin language production and comprehension.
Jose Andres Gonzalez Lopez, Alberto Galdón, Gonzalo Olivares, Sneha Raman, David Muñoz, Daniela Paolieri, Pedro Macizo, José L. Pérez-Córdoba, Antonio M. Peinado, Angel Gomez, Victoria E. Sanchez and Ana B. Chica
12:00 – 13:30
ORKESTA Comprehensive Solution for the Orchestration of Services and Soci-Sanitary Care at Home  (abs
In this paper we present the main goals of the ORKESTA
project. This is an industrial project carried out by a consortium
of companies aimed at providing products and services
contributing to improve the wellbeing of the old adults and enlarge
the years of independent life. To this end the consortium
collaborates with the Vicomtech Tecnological Center and the
Speech Interactive research Group at the UPV/EHU. Both provide
speech and language Technologies to the project.
Juan Alos, Julien Boullié, M. Inés Torres, Eneko Ruiz, Andoni Beristain, Jacobo López Fernández, Iñaki Tellería, Janeth Carolina Carreño, Iker Garay, Arkaitz Carbajo, Amaia Santamaría, Urtzi Zubiate, Jon Ander Arzallus, Francisco Martínez and Adriana Martínez
12:00 – 13:30
The CITA GO-ON trial: A person-centered, digital, intergenerational, and cost- effective dementia prevention multi-modal intervention model to guide strategic policies facing the demographic challenges of progressive aging   (abs
This paper presents a general overview of the CITA GOON
study, a controlled and randomized trial aimed to
demonstrate the efficacy and cost-effectiveness of a 2 years
multi-modal intervention to control risk factors and change
lifestyles in cognitively frail people at increased risk of
dementia. In this framework, the applicability of a virtual
agent to increase adherence and effectiveness (the “Go-ON
digital coach”) will be explored.
The multidisciplinary nature of the study brings together 7
partners including non-profit organizations, universities,
technological centers and companies.
Mikel Tainta, Javier Mikel Olaso, M. Inés Torres, Mirian Ecay-Torres, Nekane Balluerka, Naia Ros, Mikel Izquierdo, Mikel Saéz de Asteasu, Usune Etxebarria, Lucía Gayoso, Maider Mateo, Oliver Ibarrondo, Elena Alberdi, Estíbaliz Capetillo-Zárate, Jesus Angel Bravo and Pablo Martínez-Lage
12:00 – 13:30
The BioVoz Project: Secure Speech Biometrics by Deep Processing Techniques  (abs
Currently, voice biometrics systems are attracting a growing
interest driven by the need for new authentication modalities.
The BioVoz project focuses on the reliability of these systems,
threatened by various types of attacks, from a simple playback
of prerecorded speech to more sophisticated variants such as impersonation
based on voice conversion or synthesis. One problem
in detecting spoofed speech is the lack of suitable models
based on classical signal processing techniques. Therefore, the
current trend is based on the use of deep neural networks, either
for direct attack detection, or for obtaining deep feature vectors
to represent the audio signals. However, these solutions raise
many questions that are still unanswered and are the subject
of the research proposed here. These include what spectral or
temporal information should be used to feed the network, how
to compensate for the effect of acoustic noise, what network architecture
is appropriate, or what methodology should be used
for training in order to provide the network with discriminative
generalization capabilities. The present project focuses on
the search for solutions to the aforementioned problems without
forgetting a fundamental issue, little studied so far, such as the
integration of fraud detection in the whole biometrics system.
Antonio M. Peinado, Alejandro Gomez-Alanis, Jose Andres Gonzalez-Lopez, Angel M. Gomez, Eros Rosello, Manuel Chica-Villar, Jose C. Sanchez-Valera, Jose L. Perez-Cordoba and Victoria Sanchez
12:00 – 13:30
Automatic evaluation of the pronunciation of people with Down syndrome in an educational video game (EvaProDown)  (abs
The deficiencies in oral communication of people with Down
syndrome (DS) represent an important barrier towards their social
integration. Interventions based on performing exercises
of speech and language therapy have proven to be effective in
improving their communication skills. Our research group has
been involved in the development of a serious video game for
the practice of oral communication of people with Down syndrome.
The video game has proven its usefulness by being able
to motivate users to carry out practical exercises designed to improve
their communication skills related to prosody, an important
aspect of spoken communication. The video game has also
facilitated the compilation of a speech corpus called Prautocal
with a large number of utterances of people with DS. The objective
of this project is to extend the functionality of the video
game to include exercises focused on pronunciation and on improving
articulation and speech intelligibility. To do this, an
automatic pronunciation assessment module will be developed
and incorporated into the existing video game in order to complement
its functionality. In this way, using the video game,
users will be able to perform exercises autonomously to work
on aspects of speech related to both pronunciation and prosody.
 César González-Ferreras, Valentín Cardeñoso-Payo, David Escudero, Carlos Vivaracho-Pascual, Lourdes Aguilar, Valle Flores-Lucas and Mario Corrales Astorgano



12:00 – 13:30
SONOC Platform for Audio and Speech Analytics in Call Centers  (abs
This paper presents a platform for processing audio data of call centers to obtain statistical information on the telephone call. The system computes several metrics of the audio and speech to define a representation of the call flow, the audio quality, and the paralinguistic performance of the call. This way, it can model the behavior and feelings of the agent and customer involved in the conversation. This solution applies to many industries such as call centers, social communities, metaverse, customer identifications, online and offline meetings, etc. In summary, the platform leverages an already trained artificial intelligence business network to get non-verbal communication information from audio. This information translates into valuable business insights for further decision-making.
 Dayana Ribas, Antonio Miguel, Luis Guillen, Jose Javier Castejon, Juan Antonio Navarro, Alfonso Ortega and Luis Benavente



12:00 – 13:30
  ELSA Speak  (abs
In 2015 Xavier Anguera and Vu Van co-founded ELSA (English Language Speech Assistant) an app (and AI technology) to help learners of English to improve their pronunciation skills. Fast forward to 2022, the company has grown to more than 100 employees and with offices in US, Portugal, India and Vietnam. Our application (ELSA Speak) has been downloaded over 20M times and we are serving users from over 100 countries, who speak to the app and get feedback in real time.
Moving from research to a startup environment and to a product requires a mindset change in some areas (e.g. you need to always be razor-focused on what you spend time on) and is very similar in others (e.g. long hours of work, you need to be very resilient when things look bad). In The session we will share some of the learnings we acquired on our particular journey.
Xavier Anguera
12:00 – 13:30
Monoceros Labs: From Voice Applications To Voice Synthesis In The Spanish Market (abs
Creating a company in speech technologies in Spain and applying learnings from the research on dialogue systems at the University of Granada from 2009 to 2013 was not intended at first. In 2018, voice assistants landed in Spain, allowing us to extend them by creating voice and multimodal applications. We worked with users (from kids to older adults) and companies from different sectors (insurances, media) to expand their content and services to Amazon Alexa. Our motivation is breaking down barriers between technology and people using advances in the speech technology area. Voice is natural, efficient and accessible in many contexts. Our focus on people led us to learn the nuances of their needs, test in real scenarios, and launch to the market as soon as possible. After a few years, currently available synthetic voices in Spanish were not creating the best experiences we aimed for users in some use cases. We started working on Spanish neural TTS to close the gap between SOTA and the market. We are currently building our TTS platform; meanwhile working with companies and content creators to validate and learn from the possible uses of TTS, impact and benefits, which goes from content accessibility to scalability.
Nieves Abalos and Carlos Muñoz-Romero


Oral 4: Affective Computing and Applications
Tuesday, 15 November 2022 (15:00-17:00)
Chair: Carmen Peláez Moreno

15:00  – 15:20
Cross-Corpus Speech Emotion Recognition with HuBERT Self-Supervised Representation (abs
Speech Emotion Recognition (SER) is a task related to many applications in the framework of human-machine interaction. However, the lack of suitable speech emotional datasets compromises the performance of the SER systems. A lot of labeled data are required to accomplish successful training, especially for current Deep Neural Network (DNN)-based solutions. Previous works have explored different strategies for extending the training set using some emotion speech corpora available. In this paper, we evaluate the impact on the performance of crosscorpus as a data augmentation strategy for spectral representations and the recent Self-Supervised (SS) representation of Hu- BERT in an SER system. Experimental results show improvements in the accuracy of SER in the IEMOCAP dataset when extending the training set with two other datasets, EmoDB in German and
RAVDESS in English.
Miguel Pastor, Dayana Ribas, Alfonso Ortega, Antonio Miguel and Eduardo Lleida
15:20  – 15:40
Analysis of Trustworthiness Recognition models from an aural and emotional perspective (abs
Trustworthiness and deception recognition attracts the research community attention due to their relevant role in social negotiations and other relevant areas.
Despite the increasing interest in the field, there are still many questions about how to perform automatic deception detection or which features explain better how
people perceive trustworthiness. Previous studies have demonstrated that emotions and sentiments correlate with deception. However, not many articles employed deep-
learning models pre-trained on emotion recognition tasks to predict trustworthiness. For this reason, this paper will compare traditional statistical functional feature sets
proposed for performing emotion recognition, such as eGeMAPS, with features extracted from deep-learning models, like AlexNet, CNN-14 or xlsr-Wav2Vec2.0 pretrained on emotion recognition tasks. After obtaining each set of features, we will train a Support Vector Machine (SVM) model on deception detection. These experiments provide a baseline to understand how methodologies exploited in emotion recognition tasks could be applied to speech trustworthiness recognition. Utilizing
the eGeMAPs feature set on deception detection achieved an accuracy of 65.98% at turn level, and employing transfer-learning on the embeddings extracted from
a pre-trained xlsr-Wav2Vec2.0 let improve this rate until a 68.11%, surpassing the baseline on audio modality from previous works by an 8.5%.
Cristina Luna Jiménez, Ricardo Kleinlein, Syaheerah Lebai Lutfi, Juan M. Montero and Fernando Fernández-Martínez
15:40  – 16:00
Speech and Text Processing for Major Depressive Disorder Detection (abs
Major Depressive Disorder (MDD) is a common mental health issue these days.
Its early diagnostic is vital to avoid bigger consequences and provide an appropriate treatment. Speech and utterance’s transcription of patients’ interviews contain useful information sources for the automatic screening of MDD. In this sense, speech and text-based systems are proposed in this paper, using the DAIC-WOZ dataset
as experimental framework. The speech-based one is a Sequence-to-Sequence (S2S) model with a local attention mechanism. The text-based one is based on GloVe
features and a Convolutional Neural Network as classifier. A description of some of the more relevant results achieved by other research publications on DAIC-WOZ are
described as well. The goal is to provide a better understanding of the context of our systems results. In general, the S2S architecture provides mostly better results
than previous speechbased systems. The GloVe-CNN system shows even a better performance, leading to the idea that text is a more suitable information source for
the detection of MDD when it is manually developed. However, to automatically obtain high quality transcriptions is not a straightforward task, which makes necessary the development of effective speech-based systems as the presented in this research work.
Edward L. Campbell, Laura Docío Fernández, Nicholas Cummins and Carmen García Mateo
16:00  – 16:20
Bridging the Semantic Gap with Affective Acoustic Scene Analysis: an Information Retrieval-based Approach (abs
Human emotions induce physiological and physical changes in the body and can ultimately influence our actions. Their study belongs to the field of Affective
Computing, to improve human-computer interaction tasks. Defining an ’affective acoustic scene’ as an acoustic environment that can induce specific emotions, in this
work we aim to characterize acoustic scenes that elicit affective states regarding the acoustic events occurring and the available acoustic information. This is achieved by
generating emotion embeddings to define the ’affective acoustic fingerprint’ of such affective acoustic scenes. We use YAMNet, an acoustic events’ classifier trained in
Audioset to classify acoustic events in the WEMAC Audiovisual stimuli dataset. Each video in this dataset is labelled by crowd-sourcing with the categorical emotion it induces. Thus we determine the relevance of the detected acoustic events that induce each emotion by performing an affective acoustic mapping, creating interpretable acoustic fingerprints of such emotions, by means of the well-known information-retrieval-based TF-IDF algorithm. This paper intends to shed light on the path to the definition of emotional acoustic embeddings.
Clara Luis-Mingueza, Esther Rituerto-González and Carmen Peláez-Moreno
16:20  – 16:40
Detecting Gender-based Violence aftereffects from Emotional Speech Paralinguistic Features (abs
Speech is known to provide information regarding the person speaking, such as their gender, identity, emotions, and even disorders or trauma. In this paper we aim to answer the following question, can women who have suffered from gender-based violence (GBV) be distinguished from those who have not, just by using speech paralinguistic cues? In this work, we intend to demonstrate whether there exist measurable differences between the emotional expression in the voice of GBV victims (GBVV) and non-victims (Non-GBVV). The present study was carried out in the framework of the project EMPATIA-CM, whose aim is to understand the reaction of GBVV to dangerous situations and develop automatic mechanisms to
protect them. For this purpose, we use data collected and partly published from the WEMAC Database, a multimodal database containing physiological and speech
data from women who have and have not suffered from GBV while visualizing different emotion-eliciting video clips. With the performed analysis, it is proven that
such differences exist indeed and, therefore, that suffering from GBV alters the way women react to the same emotion eliciting stimulus in terms of physical variables,
specifically certain voice features.
Emma Reyner Fuentes, Esther Rituerto González, Clara Luis Mingueza, Carmen Peláez Moreno and Celia López Ongil
16:40  – 17:00
Extraction of structural and semantic features for the identification of Psychosis in European Portuguese (abs
Psychosis is a brain condition that affects the subject and the way it perceives the world around, impairing its cognitive and speech capabilities, and creating a disconnection from reality in which the subject is inserted. Psychosis lacks formal and precise diagnostic tools, relying on self-reports from patients, their families, and specialized clinicians. Previous studies have focused on the identification and prediction of psychosis through surface-level analysis of diagnosed patients targeting audio, time, and paucity features to predict or identify psychosis. More recent studies have started focusing on high-level and complex language analysis such as semantics, structure, and pragmatics. Only a reduced number of studies have targeted the Portuguese language. Currently, no study has targeted structural or semantic features in European Portuguese, thus this is our objective. The results obtained
through our work suggest that the use of structural and semantic features, particularly for European Portuguese, holds some power in classifying subjects as diagnosed
with psychosis or not. However, further research is required to identify possible improvements to the techniques employed and to concretely identify which particular
features hold the most power during the classification tasks.
Rodrigo Sousa, Helena Sofia Pinto, Alberto Abad, Daniel Neto and Joaquim Gago


Albayzin Evaluations
Tuesday, 15 November 2022 (17:20 – 19:20)

17:20  – 19:20
The Vicomtech-UPM Speech Transcription Systems for the Albayzín-RTVE 2022 Speech to Text Transcription Challenge (abs
This paper describes the Vicomtech-UPM submission to the
Albayz´ın-RTVE 2022 Speech to Text Transcription Challenge,
which calls for automatic speech transcription systems to be
evaluated in realistic TV shows. A total of 4 systems were built
and presented to the evaluation challenge, considering the primary
system alongside three contrastive systems. Each system
was built on top of one different architecture, with the aim of
testing several state-of-the-art modelling approaches focused on
different learning techniques and typologies of neural networks.
The primary system used the self-supervised Wav2vec2.0
model as the pre-trained model of the transcription engine. This
model was fine-tuned with in-domain labelled data and the initial
hypothesis re-scored with a pruned 4-gram based language
model. The first contrastive system corresponds to a pruned
RNN-Transducer model, composed of a Conformer encoder
and a stateless prediction network using BPE word-pieces as
output symbols. As the second contrastive system, we built
a Multistream-CNN acoustic model based system with a nonpruned
3-gram model for decoding, and a RNN based language
model for rescoring the initial lattices. Finally, results obtained
with the publicly available Large model of the recently published
Whisper engine were also presented within the third contrastive
system, with the aim of serving as a reference benchmark
for other engines. Along with the description of the systems,
the results obtained on the Albayzin-RTVE 2020 and
2022 test sets by each engine are presented as well.
Haritz Arzelus, Iván G. Torres, Juan Manuel Martín-Doñas, Ander González-Docasal and Aitor Alvarez
17:20  – 19:20
TID Spanish ASR system for the Albayzin 2022 Speech-to-Text Transcription Challenge (abs
This paper describes Telef´onica I+D’s participation in the
IberSPEECH-RTVE 2022 Speech-to-Text Transcription Challenge.
We built an acoustic end-to-end Automatic Speech
Recognition (ASR) based on the large XLS-R architecture. We
first trained it with already aligned data from CommonVoice.
After we adapted it to the TV broadcasting domain with a
self-supervised method. For that purpose, we used an iterative
pseudo-forced alignment algorithm fed with frame-wise character
posteriors produced by our ASR. This allowed us to recover
up to 166 hours from RTVE2018 and RTVE2022 databases. We
additionally explored using a transformer-based seq2seq translator
system as a Language Model (LM) to correct the transcripts
of the acoustic ASR. Our best system achieved 24.27%
WER in the test split of RTVE2020.
Fernando López and Jordi Luque
17:20  – 19:20
BCN2BRNO: ASR System Fusion for Albayzin 2022 Speech to Text Challenge (abs
This paper describes the joint effort of BUT and Telefónica Research
on the development of Automatic Speech Recognition
systems for the Albayzin 2022 Challenge. We train and evaluate
both hybrid systems and those based on end-to-end models.
We also investigate the use of self-supervised learning speech
representations from pre-trained models and their impact on
ASR performance (as opposed to training models directly from
scratch). Additionally, we also apply the Whisper model in a
zero-shot fashion, postprocessing its output to fit the required
transcription format. On top of tuning the model architectures
and overall training schemes, we improve the robustness of our
models by augmenting the training data with noises extracted
from the target domain. Moreover, we apply rescoring with
an external LM on top of N-best hypotheses to adjust each
sentence score and pick the single best hypothesis. All these
efforts lead to a significant WER reduction. Our single best
system and the fusion of selected systems achieved 16.3% and
13.7% WER respectively on RTVE2020 test partition, i.e. the
official evaluation partition from the previous Albayzin challenge.
Martin Kocour, Jahnavi Umesh, Martin Karafiat, Ján Švec, Fernando López, Jordi Luque, Karel Beneš, Mireia Diez, Igor Szoke, Karel Veselý, Lukáš Burget and Jan Černocký
17:20  – 19:20
BUT System for Albayzin 2022 Text and Speech Alignment Challenge     Withdrawn
Martin Kocour, Jahnavi Umesh, Martin Karafiat, Igor Szoke, Karel Beneš and Jan Černocký
17:20  – 19:20
Intelligent Voice Speaker Recognition and Diarization System for IberSpeech 2022 Albayzin Evaluations Speaker Diarization and Identity Assignment Challenge (abs
This paper describes the system developed by Intelligent Voice
for IberSpeech 2022 Albayzin Evaluations Speaker Diarization
and Identity Assignment Challenge (SDIAC). The presented
Variational Bayes x-vector Voice Print Extraction (VBxVPE)
system is capable of capturing the vocal variations using multiple
x-vector representations with two-stage clustering and outlier
detection refinement and implements Deep-Encoder Convolutional
Autoencoder Denoiser (DE-CADE) network for denoising
segments with noise and music for robust speaker
recognition and diarization. When evaluated against the Radiotelevision
Espanola (RTVE) 2022 evaluation dataset, the
system was able to obtain a Diarization Error Rate (DER) of
..% and Error Rate of ..% .
Roman Shrestha, Cornelius Glackin, Julie Wall and Nigel Cannings
17:20  – 19:20
ViVoLAB System Description for the S2TC IberSPEECH-RTVE 2022 challenge (abs
In this paper we describe the ViVoLAB system for the
IberSPEECH-RTVE 2022 Speech to Text Transcription Challenge.
The system is a combination of several subsystems designed
to perform a full subtitle edition process from the raw audio
to the creation of aligned subtitle transcribed partitions. The
subsystems include a phonetic recognizer, a phonetic subword
recognizer, a speaker-aware subtitle partitioner, a sequence-tosequence
translation model working with orthographic tokens
to produce the desired transcription, and an optional diarization
step with the previously estimated segments. Additionally, we
use recurrent network based language models to improve results
for steps that involve search algorithms like the subword
decoder and the sequence-to-sequence model. The technologies
involved include unsupervised models like Wavlm to deal with
the raw waveform, convolutional, recurrent, and transformer
layers. As a general design pattern, we allow all the systems to
access previous outputs or inner information, but the choice of
successful communication mechanisms has been a difficult process
due to the size of the datasets and long training times. The
best solution found will be described and evaluated for some
reference tests of 2018 and 2020 IberSPEECH-RTVE S2TC
Antonio Miguel, Alfonso Ortega and Eduardo Lleida
17:20  – 19:20
GTTS Systems for the Albayzin 2022 Speech and Text Alignment Challenge (abs
This paper describes the most relevant features of the alignment
approach used by our research group (GTTS) for the Albayzin
2022 Text and Speech Alignment Challenge: Alignment of respoken
subtitles (TaSAC-ST). It also presents and analyzes the
results obtained by our primary and contrastive systems, focusing
on the variability observed in the RTVE broadcasts used for
this evaluation. The task is to provide some hypothesized start
and end times for each subtitle to be aligned. To that end, our
systems decode the audio at the phonetic level using acoustic
models trained on external (non-RTVE) data, then align the recognized
sequence of phones with the phonetic transcription of
the corresponding text and transfer the timestamps of the recognized
phones to the aligned text. The alignment error for each
subtitle is computed as the sum of the absolute values of the
start and end alignment errors (with regard to a manually supervised
ground truth). The median of the alignment errors (MAE)
for each broadcast is reported to compare system performance.
Our primary system yielded MAEs between 0.20 and 0.36 seconds
on the development set, and between 0.22 and 1.30 seconds
on the test set, with average MAEs of 0.295 and 0.395,
Germán Bordel, Luis Javier Rodriguez-Fuentes, Mikel Peñagarikano and Amparo Varona