Tuesday, 15 November
Oral 3: Speech and Audio Processing
Tuesday, 15 November 2022 (9:00-10:40)
Chair: Alfonso Ortega
O3.1 09:00 – 09:20 |
On the potential of jointly-optimised solutions to spoofing attack detection and automatic speaker verification (abs The spoofing-aware speaker verification (SASV) challenge was designed to promote the study of jointly-optimised solutions to accomplish the traditionally separately- )optimised tasks of spoofing detection and speaker verification. Jointly-optimised systems have the potential to operate in synergy as a better performing solution to the single task of reliable speaker verification. However, none of the 23 submissions to SASV 2022 are jointly optimised. We have hence sought to determine why separately-optimised sub-systems perform best or why joint optimisation was not successful. Experiments reported in this paper show that joint optimisation is suc- cessful in improving robustness to spoofing but that it degrades speaker verification performance. The findings suggest that spoofing detection and speaker verification sub-systems should be optimised jointly in a manner which reflects the differences in how information provided by each sub-system is complementary to that provided by the other. Progress will also likely depend upon the collection of data from a larger number of speakers. |
Wanying Ge, Hemlata Tak, Massimiliano Todisco and Nicholas Evans | |
O3.2 09:20 – 09:40 |
A Study on the Use of wav2vec Representations for Multiclass Audio Segmentation (abs This paper presents a study on the use of new unsupervised representations through wav2vec models seeking to jointly model speech and music fragments of audio signals in a multiclass audio segmentation task. Previous studies have already described the capabilities of deep neural networks in binary and multiclass audio segmentation tasks. Particularly, the separation of speech, music and noise signals through audio segmentation shows competitive results using a combination of perceptual and musical features as input to a neural network. Wav2vec representations have been successfully applied to several speech processing applications. In this study, they are considered for the multiclass audio segmentation task presented in the Albayz ́ın 2010 evaluation. We compare the use of different representations obtained through unsupervised learning with our previous results in this database using a traditional set of features under different conditions. Experimental results show that wav2vec representations can improve the performance of audio segmentation systems for classes containing speech, while showing a degradation in the segmentation of isolated music. This trend is consistent among all experiments developed. On average, the use of unsupervised representation learning leads to a relative improvement close to 6.8% on the segmentation task. ) |
Pablo Gimeno, Alfonso Ortega, Antonio Miguel and Eduardo Lleida | |
O3.3 09:40 – 10:00 |
Respiratory Sound Classification Using an Attention LSTM Model with Mixup Data Augmentation (abs Auscultation is the most common method for the diagnosis of respiratory diseases, although it depends largely on the physician’s ability. In order to alleviate this drawback, in this paper, we present an automatic system capable of distinguishing between different types of lung sounds (neutral, wheeze, crackle) in patient’s respiratory recordings. In particular, the proposed system is based on Long Short Term-Memory (LSTM) networks fed with log-mel spectrograms, on which several improvements have been developed. Firstly, the frequency bands that contain more useful information have been experimentally determined in order to enhance the input acoustic features. Secondly, an Attention Mechanism has been incorporated into the LSTM model in order to emphasize the more relevant audio frames to the task under consideration. Finally, a Mixup data augmentation technique has been adopted in order to mitigate the problem of data imbalance and improve the sensitivity of the system. The proposed methods have been evaluated over the publicly available ICBHI 2017 dataset, achieving good results in comparison to the baseline. ) |
Noelia Salor-Burdalo and Ascension Gallardo-Antolin | |
O3.4 10:00 – 10:20 |
The Vicomtech Spoofing-Aware Biometric System for the SASV Challenge (abs This paper describes our proposed system for the spoofingaware speaker verification challenge (SASV Challenge 2022). The system follows an integrated approach that uses speaker verification and antispoofing embeddings extracted from specialized neural networks. Firstly, a shallow neural network, fed with the test utterance’s verification and spoofing embeddings, is used to compute a spoof-based score. The final scoring decision is then obtained by combining this score with the cosine similarity between speaker verification embeddings. The integration network was trained )using a one-class loss to discriminate between target and unauthorized trials. Our proposed system is evaluated over the ASVspoof19 database and shows competitive performance compared to other integration approaches. In addition, we compare our approach with further state-of-theart speaker verification and antispoofing systems based on selfsupervised learning, yielding high-performance speech biometric systems comparable with the best challenge submissions. |
Juan Manuel Martín-Doñas, Iván González Torre, Aitor Álvarez and Joaquin Arellano | |
O3.5 10:20 – 10:40 |
VoxCeleb-PT – a dataset for a speech processing course (abs This paper introduces VoxCeleb-PT, a small dataset of voices of Portuguese )celebrities that can be used as a language-specific extension of the widely used Vox-Celeb corpus. Besides introducing the corpus, we also describe three lab assignments where it was used in a one-semester speech processing course: age regression, speaker verification and speech recognition, hoping to highlight the relevance of this dataset as a pedagogical tool. Additionally, this paper confirms the overall limitations of current systems when evaluated in different languages and acoustic conditions: we found an overall degradation of performance on all of the proposed tasks. |
John Mendonca and Isabel Trancoso |
Keynote 2
Tuesday, 15 November 2022 (11:00-12:00)
KN2 11:00 – 12:00 |
Disease biomarkers in speech (abs Speech encodes information about a plethora of diseases, which go beyond the so-called speech and language disorders, and include neurodegenerative diseases, such as Parkinson’s, Alzheimer’s, and Huntington’s disease, mood and anxiety-related diseases, such as Depression and Bipolar Disease, and diseases that concern respiratory organs such as the common Cold, or Obstructive Sleep Apnea. This talk addresses the potential of speech as a health biomarker which allows a non-invasive route to early diagnosis and monitoring of a range of conditions related to human physiology and cognition. The talk will also address the many challenges that lie ahead, namely in the context of an ageing population with frequent multimorbidity, and the need to build robust models that provide explanations compatible with clinical reasoning. That would be a major step towards a future where collecting speech samples for health screening may become as common as a blood test nowadays. Speech can indeed encode health information au par with many other characteristics that make it viewed as Personal Identifiable Information. The last part of this talk will briefly discuss the privacy issues that this enormous potential may entail. ) |
Isabel Trancoso |
Posters 2: Special Sessions
Tuesday, 15 November 2022 (12:00-13:30)
Chair: Inma Hernáez
Ph.D. Thesis
P2.1 12:00 – 13:30 |
Representation and Metric Learning Advances for Deep Neural Network Face and Speaker Biometric Systems (abs Nowadays, the use of technological devices and face and speaker biometric recognition systems are becoming increasingly common in people daily lives. This fact has motivated a great deal of research interest in the development of effective and robust systems. However, although face and voice recognition systems are mature technologies, there are still some challenges )which need further improvement and continued research when Deep Neural Networks (DNNs) are employed in these systems. In this manuscript, we present an overview of the main findings of Victoria Mingote’s Thesis where different approaches to address these issues are proposed. The advances presented are focused on two streams of research. First, in the representation learning part, we propose several approaches to obtain robust representations of the signals for text-dependent speaker verification systems. While in the metric learning part, we focus on introducing new loss functions to train DNNs directly to optimize the goal task for text-dependent speaker, language and face verification and also multimodal diarization. |
Victoria Mingote and Antonio Miguel | |
P2.2 12:00 – 13:30 |
Voice Biometric Systems based on Deep Neural Networks: A Ph.D. Thesis Overview (abs Voice biometric systems based on automatic speaker verification (ASV) are exposed to spoofing attacks which may compromise their security. To increase the robustness against such attacks, anti-spoofing systems have been proposed for the detection of replay, synthesis and voice conversion based attacks. This paper summarizes the work carried out for the first author’s PhD Thesis, which focused on the development of robust biometric systems which are able to detect zero-effort, spoofing and adversarial attacks. First, we propose a gated recurrent convolutional neural network (GRCNN) for detecting both logical and physical access spoofing attacks. Second, we propose a new loss function for training neural networks classifiers based on a probabilistic framework known as kernel density estimation (KDE). Third, we propose a top-performing integration of ASV and anti-spoofing systems with a new loss function which tries to optimize the whole voice biometric system on an expected )range of operating points. Finally, we propose a generative adversarial network (GAN) for generating adversarial spoofing attacks in order to use them as a defense for building higher robust voice biometric systems. Experimental results show that the proposed techniques outperform many other state-of-the-art systems trained and evaluated in the same conditions with standard public datasets. |
Alejandro Gomez-Alanis, Jose Andres Gonzalez-Lopez and Antonio Miguel Peinado Herreros | |
P2.3 12:00 – 13:30 |
Online Multichannel Speech Enhancement combining Statistical Signal Processing and Deep Neural Networks: A Ph.D. Thesis Overview (abs Speech-related applications on mobile devices require highperformance speech enhancement algorithms to tackle challenging, noisy real-world environments. In addition, current mobile devices often embed several microphones, allowing them to exploit spatial information. The main goal of this Thesis is the development of online multichannel speech enhancement algorithms for speech services in mobile devices. The proposed techniques use multichannel signal processing to increase the noise reduction performance without degrading the quality of the speech signal. Moreover, deep neural networks are applied in specific parts of the algorithm where modeling by classical methods would be, otherwise, unfeasible or very limiting. Our )contributions focus on different noisy environments where these mobile speech technologies can be applied. These include dualmicrophone smartphones in noisy and reverberant environments and general multi-microphone devices for speech enhancement and target source separation. Moreover, we study the training of deep learning methods for speech processing using perceptual considerations. Our contributions successfully integrate signal processing and deep learning methods to exploit spectral, spatial, and temporal speech features jointly. As a result, the proposed techniques provide us with a manifold framework for robust speech processing under very challenging acoustic environments, thus allowing us to improve perceptual quality and intelligibility measures. |
Juan Manuel Martín-Doñas, Antonio M. Peinado and Angel M. Gomez |
Research and Development Projects
P2.4 12:00 – 13:30 |
ReSSInt project: voice restoration using Silent Speech Interfaces (abs ReSSInt is a project funded by the Spanish Ministry of Science and Innovation aiming at investigating the use of Silent speech interfaces (SSIs) for restoring communication to individuals who have been deprived of the ability to speak. These interfaces capture non-acoustic biosignals generated during the speech production process and use them to predict the intended message. In the project two different biosignals are being investigated: electromyography (EMG) signals representing electrical activity driving the facial muscles and intracraneal electroencephalography (iEEG) neural signals captured by means of invasive electrodes implanted on the brain. From the whole spectrum of speech disorders which may affect a person’s voice, ReSSInt will address two particular conditions: (i) voice loss after total laryngectomy and (ii) neurodegenerative diseases and other traumatic injuries which may leave an individual paralyzed and, eventually, unable to speak. In this paper we describe the current status of the project as well as the problems and difficulties encountered in its development. ) |
Inma Hernaez, Jose Andres Gonzalez Lopez, Eva Navas, Jose Luis Pérez Córdoba, Ibon Saratxaga, Gonzalo Olivares, Jon Sanchez de la Fuente, Alberto Galdón, Victor Garcia, Jesús del Castillo, Inge Salomons and Eder del Blanco Sierra | |
P2.5 12:00 – 13:30 |
ELE Project: an overview of the desk research (abs This paper provides an overview of the European Language Equality (ELE) project. The main objective of ELE is to prepare )the European Language Equality program in the form of a strategic research and innovation agenda that may be utilized as a road map for achieving full digital language equality in Europe by 2030. The desk research phase of ELE concentrated on the systematic collection and analysis of the existing international, national, and regional strategic research agendas, studies, reports, and initiatives related to language technology and LTrelated artificial intelligence. A brief survey of the findings is presented here, with a special focus on the Spanish ecosystem. |
Itziar Aldabe, Aritz Farwell, Eva Navas, Inma Hernaez, German Rigau | |
P2.6 12:00 – 13:30 |
Snorble: An Interactive Children Companion (abs This paper presents an interactive companion called Snorble, created to engage with children and promote the development of healthy habits under the Snorble project. )Snorble is a smart companion capable of having a conversation with children, playing games, and helping them to go to sleep, all made possible thanks to speech recognition. |
Mike Rizkalla, Thomas Chan, Emilio Granell, Chara Tsoukala, Aitor Carricondo, Carlos Bailon, María Teresa González and Vicent Alabau | |
P2.7 12:00 – 13:30 |
Fusion of Classical Digital Signal Processing and Deep Learning methods (FTCAPPS) (abs The use of deep learning approaches in Signal Processing is )finally showing a trend towards a rational use. After an effervescent period where research activity seemed to focus on seeking old problems to apply solutions entirely based on neural networks, we have reached a more mature stage where integrative approaches are on the rise. These approaches gather the best from each paradigm: on the one hand, the knowledge and elegance of classical signal processing and, on the other, the great ability to model and learn from data which is inherent to deep learning methods. In this project we aim towards a new signal processing paradigm where classical and deep learning techniques not only collaborate, but fuse themselves. In particular, we focus on two objectives: 1) the development of deep learning architectures based on or inspired by signal processing schemes, and 2) the improvement of current deep learning training methods by means of classical techniques and algorithms, particularly, by exploiting the knowledge legacy they treasure. These innovations will be applied to two socially and scientifically relevant topics in which our research group has been working for years. The first one is the enhancement of speech signal acquired under acoustic adverse conditions (e.g., noise, reverberation, other speakers, …). The second one is the development of anti-fraud measures for biometric voice authentication, in which banking corporations and other large companies are strongly interested. |
Angel M. Gómez, Victoria E. Sanchez, Antonio M. Peinado, Juan M. Martín-Doñas, Alejandro Gómez-Alanis, Amelia Villegas-Morcillo, Eros Rosello, Manuel Chica, Celia García and Ivan López-Espejo | |
P2.8 12:00 – 13:30 |
Spanish Lipreading in Realistic Scenarios: the LLEER project (abs Automatic speech recognition has been usually performed by )using only the audio data, but speech communication is affected as well by other non-audio sources, mainly visual cues. Visual information includes body expression, face expression, and lip movements, among other. Lip reading, also known as Visual Speech Recognition, aims at decoding speech by only using the image of the lip movements. Current approaches for automatic lip reading follow the same lines than for speech processing: use of massive data for training deep learning models that allow to perform speech recognition. However, most of the datasets and models are devoted to languages such as English or Chinese, while other languages, particularly Spanish, are underrepresented. The LLEER (Lectura de Labios en Espa˜nol en Escenarios Realistas) project aims at the acquisition of largescale visual corpora for Spanish lip reading, the development of visual processing techniques that allow to extract important information for the task, the implementation of models for automatic lip reading, and the integration with speech recognition models for audiovisual speech recognition. |
Carlos David Martinez Hinarejos, David Gimeno-Gomez, Francisco Casacuberta, Emilio Granell, Roberto Paredes, Moisés Pastor and Enrique Vidal | |
P2.9 12:00 – 13:30 |
Clinical Applications of Neuroscience: Locating Language Areas in Epileptic Patients and Restoring Speech in Paralyzed People (abs The goal of this project is to study the neurological bases of )language using intracranial electroencephalography (iEEG) signals recorded from drug-resistant epilepsy patients. In particular, we aim to address two current clinical challenges. Firstly, we intend to individually identify the brain regions involved in the production and understanding of language, in order to preserve these regions during brain surgery for epilepsy treatment. Secondly, this project also aims to develop novel pattern recognition algorithms that can decode speech from iEEG signals obtained from participants performing language production tasks. The ultimate goal is to evaluate the feasibility of a neuroprosthetic device that could restore oral communication in persons that cannot speak following a neurodegenerative disease or brain damage. For both goals, a series of experimental tasks will be developed in order to thoroughly evaluate language production and comprehension. Furthermore, data derived from these tasks will be analyzed using state-of-the-art multivariate statistical methods and machine learning techniques (e.g., deep learning). In addition to having a social impact, the results of this project will also help in advancing the knowledge about the neural substrates that underpin language production and comprehension. |
Jose Andres Gonzalez Lopez, Alberto Galdón, Gonzalo Olivares, Sneha Raman, David Muñoz, Daniela Paolieri, Pedro Macizo, José L. Pérez-Córdoba, Antonio M. Peinado, Angel Gomez, Victoria E. Sanchez and Ana B. Chica | |
P2.10 12:00 – 13:30 |
ORKESTA Comprehensive Solution for the Orchestration of Services and Soci-Sanitary Care at Home (abs In this paper we present the main goals of the ORKESTA )project. This is an industrial project carried out by a consortium of companies aimed at providing products and services contributing to improve the wellbeing of the old adults and enlarge the years of independent life. To this end the consortium collaborates with the Vicomtech Tecnological Center and the Speech Interactive research Group at the UPV/EHU. Both provide speech and language Technologies to the project. |
Juan Alos, Julien Boullié, M. Inés Torres, Eneko Ruiz, Andoni Beristain, Jacobo López Fernández, Iñaki Tellería, Janeth Carolina Carreño, Iker Garay, Arkaitz Carbajo, Amaia Santamaría, Urtzi Zubiate, Jon Ander Arzallus, Francisco Martínez and Adriana Martínez | |
P2.11 12:00 – 13:30 |
The CITA GO-ON trial: A person-centered, digital, intergenerational, and cost- effective dementia prevention multi-modal intervention model to guide strategic policies facing the demographic challenges of progressive aging (abs This paper presents a general overview of the CITA GOON )study, a controlled and randomized trial aimed to demonstrate the efficacy and cost-effectiveness of a 2 years multi-modal intervention to control risk factors and change lifestyles in cognitively frail people at increased risk of dementia. In this framework, the applicability of a virtual agent to increase adherence and effectiveness (the “Go-ON digital coach”) will be explored. The multidisciplinary nature of the study brings together 7 partners including non-profit organizations, universities, technological centers and companies. |
Mikel Tainta, Javier Mikel Olaso, M. Inés Torres, Mirian Ecay-Torres, Nekane Balluerka, Naia Ros, Mikel Izquierdo, Mikel Saéz de Asteasu, Usune Etxebarria, Lucía Gayoso, Maider Mateo, Oliver Ibarrondo, Elena Alberdi, Estíbaliz Capetillo-Zárate, Jesus Angel Bravo and Pablo Martínez-Lage | |
P2.12 12:00 – 13:30 |
The BioVoz Project: Secure Speech Biometrics by Deep Processing Techniques (abs Currently, voice biometrics systems are attracting a growing )interest driven by the need for new authentication modalities. The BioVoz project focuses on the reliability of these systems, threatened by various types of attacks, from a simple playback of prerecorded speech to more sophisticated variants such as impersonation based on voice conversion or synthesis. One problem in detecting spoofed speech is the lack of suitable models based on classical signal processing techniques. Therefore, the current trend is based on the use of deep neural networks, either for direct attack detection, or for obtaining deep feature vectors to represent the audio signals. However, these solutions raise many questions that are still unanswered and are the subject of the research proposed here. These include what spectral or temporal information should be used to feed the network, how to compensate for the effect of acoustic noise, what network architecture is appropriate, or what methodology should be used for training in order to provide the network with discriminative generalization capabilities. The present project focuses on the search for solutions to the aforementioned problems without forgetting a fundamental issue, little studied so far, such as the integration of fraud detection in the whole biometrics system. |
Antonio M. Peinado, Alejandro Gomez-Alanis, Jose Andres Gonzalez-Lopez, Angel M. Gomez, Eros Rosello, Manuel Chica-Villar, Jose C. Sanchez-Valera, Jose L. Perez-Cordoba and Victoria Sanchez | |
P2.13 12:00 – 13:30 |
Automatic evaluation of the pronunciation of people with Down syndrome in an educational video game (EvaProDown) (abs The deficiencies in oral communication of people with Down )syndrome (DS) represent an important barrier towards their social integration. Interventions based on performing exercises of speech and language therapy have proven to be effective in improving their communication skills. Our research group has been involved in the development of a serious video game for the practice of oral communication of people with Down syndrome. The video game has proven its usefulness by being able to motivate users to carry out practical exercises designed to improve their communication skills related to prosody, an important aspect of spoken communication. The video game has also facilitated the compilation of a speech corpus called Prautocal with a large number of utterances of people with DS. The objective of this project is to extend the functionality of the video game to include exercises focused on pronunciation and on improving articulation and speech intelligibility. To do this, an automatic pronunciation assessment module will be developed and incorporated into the existing video game in order to complement its functionality. In this way, using the video game, users will be able to perform exercises autonomously to work on aspects of speech related to both pronunciation and prosody. |
César González-Ferreras, Valentín Cardeñoso-Payo, David Escudero, Carlos Vivaracho-Pascual, Lourdes Aguilar, Valle Flores-Lucas and Mario Corrales Astorgano |
Demos
P2.14 12:00 – 13:30 |
SONOC Platform for Audio and Speech Analytics in Call Centers (abs This paper presents a platform for processing audio data of call centers to obtain statistical information on the telephone call. The system computes several metrics of the audio and speech to define a representation of the call flow, the audio quality, and the paralinguistic performance of the call. This way, it can model the behavior and feelings of the agent and customer involved in the conversation. This solution applies to many industries such as call centers, social communities, metaverse, customer identifications, online and offline meetings, etc. In summary, the platform leverages an already trained artificial intelligence business network to get non-verbal communication information from audio. This information translates into valuable business insights for further decision-making. ) |
Dayana Ribas, Antonio Miguel, Luis Guillen, Jose Javier Castejon, Juan Antonio Navarro, Alfonso Ortega and Luis Benavente |
Entrepreneurship
P2.15 12:00 – 13:30 |
ELSA Speak (abs In 2015 Xavier Anguera and Vu Van co-founded ELSA (English Language Speech Assistant) an app (and AI technology) to help learners of English to improve their pronunciation skills. Fast forward to 2022, the company has grown to more than 100 employees and with offices in US, Portugal, India and Vietnam. Our application (ELSA Speak) has been downloaded over 20M times and we are serving users from over 100 countries, who speak to the app and get feedback in real time. )Moving from research to a startup environment and to a product requires a mindset change in some areas (e.g. you need to always be razor-focused on what you spend time on) and is very similar in others (e.g. long hours of work, you need to be very resilient when things look bad). In The session we will share some of the learnings we acquired on our particular journey. |
Xavier Anguera | |
P2.16 12:00 – 13:30 |
Monoceros Labs: From Voice Applications To Voice Synthesis In The Spanish Market (abs Creating a company in speech technologies in Spain and applying learnings from the research on dialogue systems at the University of Granada from 2009 to 2013 was not intended at first. In 2018, voice assistants landed in Spain, allowing us to extend them by creating voice and multimodal applications. We worked with users (from kids to older adults) and companies from different sectors (insurances, media) to expand their content and services to Amazon Alexa. Our motivation is breaking down barriers between technology and people using advances in the speech technology area. Voice is natural, efficient and accessible in many contexts. Our focus on people led us to learn the nuances of their needs, test in real scenarios, and launch to the market as soon as possible. After a few years, currently available synthetic voices in Spanish were not creating the best experiences we aimed for users in some use cases. We started working on Spanish neural TTS to close the gap between SOTA and the market. We are currently building our TTS platform; meanwhile working with companies and content creators to validate and learn from the possible uses of TTS, impact and benefits, which goes from content accessibility to scalability. ) |
Nieves Abalos and Carlos Muñoz-Romero |
Oral 4: Affective Computing and Applications
Tuesday, 15 November 2022 (15:00-17:00)
Chair: Carmen Peláez Moreno
O4.1 15:00 – 15:20 |
Cross-Corpus Speech Emotion Recognition with HuBERT Self-Supervised Representation (abs Speech Emotion Recognition (SER) is a task related to many applications in the framework of human-machine interaction. However, the lack of suitable speech emotional datasets compromises the performance of the SER systems. A lot of labeled data are required to accomplish successful training, especially for current Deep Neural Network (DNN)-based solutions. Previous works have explored different strategies for extending the training set using some emotion speech corpora available. In this paper, we evaluate the impact on the performance of crosscorpus as a data augmentation strategy for spectral representations and the recent Self-Supervised (SS) representation of Hu- BERT in an SER system. Experimental results show improvements in the accuracy of SER in the IEMOCAP dataset when extending the training set with two other datasets, EmoDB in German and )RAVDESS in English. |
Miguel Pastor, Dayana Ribas, Alfonso Ortega, Antonio Miguel and Eduardo Lleida | |
O4.2 15:20 – 15:40 |
Analysis of Trustworthiness Recognition models from an aural and emotional perspective (abs Trustworthiness and deception recognition attracts the research community attention due to their relevant role in social negotiations and other relevant areas. )Despite the increasing interest in the field, there are still many questions about how to perform automatic deception detection or which features explain better how people perceive trustworthiness. Previous studies have demonstrated that emotions and sentiments correlate with deception. However, not many articles employed deep- learning models pre-trained on emotion recognition tasks to predict trustworthiness. For this reason, this paper will compare traditional statistical functional feature sets proposed for performing emotion recognition, such as eGeMAPS, with features extracted from deep-learning models, like AlexNet, CNN-14 or xlsr-Wav2Vec2.0 pretrained on emotion recognition tasks. After obtaining each set of features, we will train a Support Vector Machine (SVM) model on deception detection. These experiments provide a baseline to understand how methodologies exploited in emotion recognition tasks could be applied to speech trustworthiness recognition. Utilizing the eGeMAPs feature set on deception detection achieved an accuracy of 65.98% at turn level, and employing transfer-learning on the embeddings extracted from a pre-trained xlsr-Wav2Vec2.0 let improve this rate until a 68.11%, surpassing the baseline on audio modality from previous works by an 8.5%. |
Cristina Luna Jiménez, Ricardo Kleinlein, Syaheerah Lebai Lutfi, Juan M. Montero and Fernando Fernández-Martínez | |
O4.3 15:40 – 16:00 |
Speech and Text Processing for Major Depressive Disorder Detection (abs Major Depressive Disorder (MDD) is a common mental health issue these days. )Its early diagnostic is vital to avoid bigger consequences and provide an appropriate treatment. Speech and utterance’s transcription of patients’ interviews contain useful information sources for the automatic screening of MDD. In this sense, speech and text-based systems are proposed in this paper, using the DAIC-WOZ dataset as experimental framework. The speech-based one is a Sequence-to-Sequence (S2S) model with a local attention mechanism. The text-based one is based on GloVe features and a Convolutional Neural Network as classifier. A description of some of the more relevant results achieved by other research publications on DAIC-WOZ are described as well. The goal is to provide a better understanding of the context of our systems results. In general, the S2S architecture provides mostly better results than previous speechbased systems. The GloVe-CNN system shows even a better performance, leading to the idea that text is a more suitable information source for the detection of MDD when it is manually developed. However, to automatically obtain high quality transcriptions is not a straightforward task, which makes necessary the development of effective speech-based systems as the presented in this research work. |
Edward L. Campbell, Laura Docío Fernández, Nicholas Cummins and Carmen García Mateo | |
O4.4 16:00 – 16:20 |
Bridging the Semantic Gap with Affective Acoustic Scene Analysis: an Information Retrieval-based Approach (abs Human emotions induce physiological and physical changes in the body and can ultimately influence our actions. Their study belongs to the field of Affective )Computing, to improve human-computer interaction tasks. Defining an ’affective acoustic scene’ as an acoustic environment that can induce specific emotions, in this work we aim to characterize acoustic scenes that elicit affective states regarding the acoustic events occurring and the available acoustic information. This is achieved by generating emotion embeddings to define the ’affective acoustic fingerprint’ of such affective acoustic scenes. We use YAMNet, an acoustic events’ classifier trained in Audioset to classify acoustic events in the WEMAC Audiovisual stimuli dataset. Each video in this dataset is labelled by crowd-sourcing with the categorical emotion it induces. Thus we determine the relevance of the detected acoustic events that induce each emotion by performing an affective acoustic mapping, creating interpretable acoustic fingerprints of such emotions, by means of the well-known information-retrieval-based TF-IDF algorithm. This paper intends to shed light on the path to the definition of emotional acoustic embeddings. |
Clara Luis-Mingueza, Esther Rituerto-González and Carmen Peláez-Moreno | |
O4.5 16:20 – 16:40 |
Detecting Gender-based Violence aftereffects from Emotional Speech Paralinguistic Features (abs Speech is known to provide information regarding the person speaking, such as their gender, identity, emotions, and even disorders or trauma. In this paper we aim to answer the following question, can women who have suffered from gender-based violence (GBV) be distinguished from those who have not, just by using speech paralinguistic cues? In this work, we intend to demonstrate whether there exist measurable differences between the emotional expression in the voice of GBV victims (GBVV) and non-victims (Non-GBVV). The present study was carried out in the framework of the project EMPATIA-CM, whose aim is to understand the reaction of GBVV to dangerous situations and develop automatic mechanisms to )protect them. For this purpose, we use data collected and partly published from the WEMAC Database, a multimodal database containing physiological and speech data from women who have and have not suffered from GBV while visualizing different emotion-eliciting video clips. With the performed analysis, it is proven that such differences exist indeed and, therefore, that suffering from GBV alters the way women react to the same emotion eliciting stimulus in terms of physical variables, specifically certain voice features. |
Emma Reyner Fuentes, Esther Rituerto González, Clara Luis Mingueza, Carmen Peláez Moreno and Celia López Ongil | |
O4.6 16:40 – 17:00 |
Extraction of structural and semantic features for the identification of Psychosis in European Portuguese (abs Psychosis is a brain condition that affects the subject and the way it perceives the world around, impairing its cognitive and speech capabilities, and creating a disconnection from reality in which the subject is inserted. Psychosis lacks formal and precise diagnostic tools, relying on self-reports from patients, their families, and specialized clinicians. Previous studies have focused on the identification and prediction of psychosis through surface-level analysis of diagnosed patients targeting audio, time, and paucity features to predict or identify psychosis. More recent studies have started focusing on high-level and complex language analysis such as semantics, structure, and pragmatics. Only a reduced number of studies have targeted the Portuguese language. Currently, no study has targeted structural or semantic features in European Portuguese, thus this is our objective. The results obtained )through our work suggest that the use of structural and semantic features, particularly for European Portuguese, holds some power in classifying subjects as diagnosed with psychosis or not. However, further research is required to identify possible improvements to the techniques employed and to concretely identify which particular features hold the most power during the classification tasks. |
Rodrigo Sousa, Helena Sofia Pinto, Alberto Abad, Daniel Neto and Joaquim Gago |
Albayzin Evaluations
Tuesday, 15 November 2022 (17:20 – 19:20)
A.1 17:20 – 19:20 |
The Vicomtech-UPM Speech Transcription Systems for the Albayzín-RTVE 2022 Speech to Text Transcription Challenge (abs This paper describes the Vicomtech-UPM submission to the )Albayz´ın-RTVE 2022 Speech to Text Transcription Challenge, which calls for automatic speech transcription systems to be evaluated in realistic TV shows. A total of 4 systems were built and presented to the evaluation challenge, considering the primary system alongside three contrastive systems. Each system was built on top of one different architecture, with the aim of testing several state-of-the-art modelling approaches focused on different learning techniques and typologies of neural networks. The primary system used the self-supervised Wav2vec2.0 model as the pre-trained model of the transcription engine. This model was fine-tuned with in-domain labelled data and the initial hypothesis re-scored with a pruned 4-gram based language model. The first contrastive system corresponds to a pruned RNN-Transducer model, composed of a Conformer encoder and a stateless prediction network using BPE word-pieces as output symbols. As the second contrastive system, we built a Multistream-CNN acoustic model based system with a nonpruned 3-gram model for decoding, and a RNN based language model for rescoring the initial lattices. Finally, results obtained with the publicly available Large model of the recently published Whisper engine were also presented within the third contrastive system, with the aim of serving as a reference benchmark for other engines. Along with the description of the systems, the results obtained on the Albayzin-RTVE 2020 and 2022 test sets by each engine are presented as well. |
Haritz Arzelus, Iván G. Torres, Juan Manuel Martín-Doñas, Ander González-Docasal and Aitor Alvarez | |
A.2 17:20 – 19:20 |
TID Spanish ASR system for the Albayzin 2022 Speech-to-Text Transcription Challenge (abs This paper describes Telef´onica I+D’s participation in the )IberSPEECH-RTVE 2022 Speech-to-Text Transcription Challenge. We built an acoustic end-to-end Automatic Speech Recognition (ASR) based on the large XLS-R architecture. We first trained it with already aligned data from CommonVoice. After we adapted it to the TV broadcasting domain with a self-supervised method. For that purpose, we used an iterative pseudo-forced alignment algorithm fed with frame-wise character posteriors produced by our ASR. This allowed us to recover up to 166 hours from RTVE2018 and RTVE2022 databases. We additionally explored using a transformer-based seq2seq translator system as a Language Model (LM) to correct the transcripts of the acoustic ASR. Our best system achieved 24.27% WER in the test split of RTVE2020. |
Fernando López and Jordi Luque | |
A.3 17:20 – 19:20 |
BCN2BRNO: ASR System Fusion for Albayzin 2022 Speech to Text Challenge (abs This paper describes the joint effort of BUT and Telefónica Research )on the development of Automatic Speech Recognition systems for the Albayzin 2022 Challenge. We train and evaluate both hybrid systems and those based on end-to-end models. We also investigate the use of self-supervised learning speech representations from pre-trained models and their impact on ASR performance (as opposed to training models directly from scratch). Additionally, we also apply the Whisper model in a zero-shot fashion, postprocessing its output to fit the required transcription format. On top of tuning the model architectures and overall training schemes, we improve the robustness of our models by augmenting the training data with noises extracted from the target domain. Moreover, we apply rescoring with an external LM on top of N-best hypotheses to adjust each sentence score and pick the single best hypothesis. All these efforts lead to a significant WER reduction. Our single best system and the fusion of selected systems achieved 16.3% and 13.7% WER respectively on RTVE2020 test partition, i.e. the official evaluation partition from the previous Albayzin challenge. |
Martin Kocour, Jahnavi Umesh, Martin Karafiat, Ján Švec, Fernando López, Jordi Luque, Karel Beneš, Mireia Diez, Igor Szoke, Karel Veselý, Lukáš Burget and Jan Černocký | |
17:20 – 19:20 |
|
A.5 17:20 – 19:20 |
Intelligent Voice Speaker Recognition and Diarization System for IberSpeech 2022 Albayzin Evaluations Speaker Diarization and Identity Assignment Challenge (abs This paper describes the system developed by Intelligent Voice )for IberSpeech 2022 Albayzin Evaluations Speaker Diarization and Identity Assignment Challenge (SDIAC). The presented Variational Bayes x-vector Voice Print Extraction (VBxVPE) system is capable of capturing the vocal variations using multiple x-vector representations with two-stage clustering and outlier detection refinement and implements Deep-Encoder Convolutional Autoencoder Denoiser (DE-CADE) network for denoising segments with noise and music for robust speaker recognition and diarization. When evaluated against the Radiotelevision Espanola (RTVE) 2022 evaluation dataset, the system was able to obtain a Diarization Error Rate (DER) of ..% and Error Rate of ..% . |
Roman Shrestha, Cornelius Glackin, Julie Wall and Nigel Cannings | |
A.6 17:20 – 19:20 |
ViVoLAB System Description for the S2TC IberSPEECH-RTVE 2022 challenge (abs In this paper we describe the ViVoLAB system for the )IberSPEECH-RTVE 2022 Speech to Text Transcription Challenge. The system is a combination of several subsystems designed to perform a full subtitle edition process from the raw audio to the creation of aligned subtitle transcribed partitions. The subsystems include a phonetic recognizer, a phonetic subword recognizer, a speaker-aware subtitle partitioner, a sequence-tosequence translation model working with orthographic tokens to produce the desired transcription, and an optional diarization step with the previously estimated segments. Additionally, we use recurrent network based language models to improve results for steps that involve search algorithms like the subword decoder and the sequence-to-sequence model. The technologies involved include unsupervised models like Wavlm to deal with the raw waveform, convolutional, recurrent, and transformer layers. As a general design pattern, we allow all the systems to access previous outputs or inner information, but the choice of successful communication mechanisms has been a difficult process due to the size of the datasets and long training times. The best solution found will be described and evaluated for some reference tests of 2018 and 2020 IberSPEECH-RTVE S2TC evaluations. |
Antonio Miguel, Alfonso Ortega and Eduardo Lleida | |
A.7 17:20 – 19:20 |
GTTS Systems for the Albayzin 2022 Speech and Text Alignment Challenge (abs This paper describes the most relevant features of the alignment )approach used by our research group (GTTS) for the Albayzin 2022 Text and Speech Alignment Challenge: Alignment of respoken subtitles (TaSAC-ST). It also presents and analyzes the results obtained by our primary and contrastive systems, focusing on the variability observed in the RTVE broadcasts used for this evaluation. The task is to provide some hypothesized start and end times for each subtitle to be aligned. To that end, our systems decode the audio at the phonetic level using acoustic models trained on external (non-RTVE) data, then align the recognized sequence of phones with the phonetic transcription of the corresponding text and transfer the timestamps of the recognized phones to the aligned text. The alignment error for each subtitle is computed as the sum of the absolute values of the start and end alignment errors (with regard to a manually supervised ground truth). The median of the alignment errors (MAE) for each broadcast is reported to compare system performance. Our primary system yielded MAEs between 0.20 and 0.36 seconds on the development set, and between 0.22 and 1.30 seconds on the test set, with average MAEs of 0.295 and 0.395, respectively. |
Germán Bordel, Luis Javier Rodriguez-Fuentes, Mikel Peñagarikano and Amparo Varona |