Keynote Lectures

Speech Technologies: Reaching Maturity?
Isabel Trancoso, L2f INESC-ID/IST, Portugal

Re-identification: State of the Art and Current Trends
Vittorio Murino, University of Verona, Italy

The Next Grand Challenge in Computer Vision: From Gesture Recognition to Sign Language Recognition
Lale Akarun, Bogazici University, Turkey

Learning to See by Hearing
Antonio Torralba, Massachusetts Institute of Technology, United States

Speech Technologies: Reaching Maturity?

Isabel Trancoso
L2f INESC-ID/IST
Portugal

Brief Bio
Isabel Trancoso received the Licenciado, Mestre, Doutor and Agregado degrees in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, Portugal, in 1979, 1984, 1987 and 2002, respectively. She has been a lecturer at this University since 1979, having coordinated the EEC course for 6 years. She is currently a Full Professor, teaching speech processing courses. She is the President of the Electrical and Computer Engineering Department. She is also a senior researcher at INESC ID Lisbon, having launched the speech processing group, now restructured as L2F, in 1990. Her first research topic was medium-to-low bit rate speech coding. From October 1984 through June 1985, she worked on this topic at AT&T Bell Laboratories, Murray Hill, New Jersey. After her PhD, her research focus shifted to speech synthesis and recognition, with a special emphasis on tools and resources for the Portuguese language. Her current research scope is much broader, encompassing many areas in spoken language processing. Her recent PhD advising activities cover microblog translation, semi-supervised machine learning for statistical machine translation, privacy preserving speech mining, lexical and prosodic entrainment in spoken dialogues and disfluency detection in spontaneous speech. She was a member of the ISCA (International Speech Communication Association) Board (1993-1998), the IEEE Speech Technical Committee (since 1999) and the Permanent Council for the Organization of the International Conferences on Spoken Language Processing (since 1998). She was elected Editor in Chief of the IEEE Transactions on Speech and Audio Processing (2003-2005), Member-at-Large of the IEEE Signal Processing Society Board of Governors (2006-2008), Vice-President of ISCA (2005-2007) and President of ISCA (2007-2011). She chaired the Organizing Committee of the INTERSPEECH'2005 Conference that took place in September 2005, in Lisbon. She also chaired the IEEE James Flanagan Award Committee (2013-2014). She currently integrates the ISCA Advisory Council, the ISCA Distinguished Lecturer Selection Committee (Chair), the ELRA Board (Vice-President), the IEEE Fellows Committee, and the IEEE Publication Services and Products Board Strategic Planning Committee. She received the 2009 IEEE Signal Processing Society Meritorious Service Award. She was elevated to IEEE Fellow in 2011, and to ISCA Fellow in 2014.

Abstract
After many decades of unfulfilled promises, speech technologies have reached a performance level that enables their use in a very wide range of applications. Recent progress comes from increasing processing power, and huge amounts of training data, making it possible to use advanced machine learning techniques such as deep learning. But are we there yet?

Rather than reviewing the limitations and the promises of current spoken language technologies in general, we concentrate in two applications areas for which the potential of these technologies is still, in our opinion, very unexplored: eHealth and eLearning.

Re-identification: State of the Art and Current Trends

Vittorio Murino
University of Verona
Italy
https://www.vittoriomurino.com/

Brief Bio
Vittorio Murino is full professor at the University of Verona, Italy, and has also a double appointment with University of Genova. He took the Laurea degree in Electronic Engineering in 1989 and the Ph.D. in Electronic Engineering and Computer Science in 1993 at the University of Genova, Italy. From 2009 to 2019, he worked at the Istituto Italiano di Tecnologia in Genova, Italy, as founder and director of the PAVIS (Pattern Analysis and Computer Vision) department, with which he is still collaborating now as a visiting scientist. From 2019 to 2021, he worked as Senior Video Intelligence Expert at the Ireland Research Centre of Huawei Technologies (Ireland) Co., Ltd. in Dublin. His main research interests include computer vision and machine learning, nowadays focusing on deep learning approaches, domain adaptation and generalization, and multimodal learning for (human) behavior analysis and related applications, such as video surveillance and biomedical imaging. Prof. Murino is co-author of more than 400 papers published in refereed journals and international conferences, member of the technical committees of important conferences (CVPR, ICCV, ECCV, ICPR, ICIP, etc.), and guest co-editor of special issues in relevant scientific journals. He is also member of the editorial board of Computer Vision and Image Understanding and Machine Vision & Applications journals. Finally, prof. Murino is IEEE Fellow, IAPR Fellow, and ELLIS Fellow.

Abstract
Re-identification (re-id) is a very peculiar topic in video surveillance and it also finds important applications in other domains.
It consists in the (re-)identification of a person over several cameras with non-overlapping view fields, eventually installed in very different locations. Since the recognition process is mainly based on the appearance of the subjects, it is implicitly assumed that clothing remains the same, but it does not involve face recognition or other (hard) biometric cues. This topic emerged about at the beginning of the century and has had a boost in the last few years, when video monitoring and surveillance applications became more and more important in our lives.
Since then, a large number of re-id methods have been proposed, roughly based on the design of ad hoc person's signatures able to finely characterise each subject, or on the learning of suitable metrics to compare appearance features extracted from each person. Besides, a number of benchmark public datasets have been proposed so that the scientific community can compete on a common scenario and operative conditions. Despite its simple definition, re-id is a very challenging problem, especially because human beings are one of the most challenging deformable, part-based "object" to characterise univocally, it can be located in any (indoor/outdoor) environment, and can be occluded by other people or objects. These issues make re-id an interesting topic which far from being effectively solved in real environments and actual applications.
In this talk, I will define the problem and describe the main methods and scenarios set to tackle it, starting from the main basic techniques up to the more recent ones. These include direct methods, based on signature definition and matching, and (metric) learning methods, based on the learning of feature transformations, so that person signatures can be projected in suitable spaces with increased discriminative power. More recent approaches like those based on deep learning or using other sensorial modalities will be also quoted. The ultimate aim is to provide a complete overview of the problem, approaches and future perspectives of such a topic, interesting and challenging in both scientific and practical terms.

The Next Grand Challenge in Computer Vision: From Gesture Recognition to Sign Language Recognition

Lale Akarun
Bogazici University
Turkey
https://www.cmpe.boun.edu.tr/~akarun/doku.php

Brief Bio
Lale Akarun received the PhD degree in Electrical Engineering from the Polytechnic School of Engineering of NYU, in 1992. She has been a faculty member of Bogazici University, Istanbul since 1993. She has served as a faculty member in the Electrical-Electronic Engineering and Computer Engineering Departments. She became a full professor of Computer Engineering in 2002. She has served as Department Head of Computer Engineering (2010-2012) and Vice Rector for Research (2012-2016). As Vice Rector, her responsibilities include Sponsored Research Projects, Technology Transfer, Incubation Centers and Technoparks of the University. Her research areas are image processing, computer vision and computer graphics. She has supervised 50 graduate theses and published more than 200 scholarly papers in scientific journals and refereed conferences. She has conducted research projects in biometrics, face recognition, hand gesture recognition, human-computer interaction, and sign language recognition. She was involved in organizing SIU92-2016, NSIP 99, ICASSP2005, eNTERFACE2007, ICPR2010 and ICMI2014.

Abstract
Gesture recognition has attracted the interest of researchers for decades: It was envisioned to be an attractive alternative to a mouse. In the air gestures are now commonly used in many applications. Two factors have accelerated this development: Sensors and more powerful machine learning algorithms. RGBD sensors make it possible to extract the human body from the background and powerful machine learning methods can estimate the pose of a human body. It is now possible to extract features from the articulated skeleton and recognize gestures. This has helped bring to life many applications using gestures of the human body.

The ultimate grand challenge in gesture recognition is, of course, sign language recognition. In sign language, body gestures, hand shapes, and facial expressions all convey meaning.Sign language is the native means of communication of the Deaf. Each Deaf community has its own sign language, so there are as many sign languages as Deaf communities. American Sign Language is the most studied sign language in computer vision – but recent developments in RGBD sensors and deep learning methods have accelerated work in other languages, such as Chinese Sign Language, German Sign Language, British Sign Language, (and Turkish Sign Language, among others). In this talk, I will give an overview of recent work, and talk about some unsolved challenges.

Learning to See by Hearing

Antonio Torralba
Massachusetts Institute of Technology
United States

Brief Bio
Antonio Torralba received the degree in telecommunications engineering from Telecom BCN, Spain, in 1994 and the Ph.D. degree in signal, image, and speech processing from the Institut National Polytechnique de Grenoble, France, in 2000. From 2000 to 2005, he spent postdoctoral training at the Brain and Cognitive Science Department and the Computer Science and Artificial Intelligence Laboratory, MIT. He is now a Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT). Prof. Torralba is an Associate Editor of the International Journal in Computer Vision, and has served as program chair for the Computer Vision and Pattern Recognition conference in 2015. He received the 2008 National Science Foundation (NSF) Career award, the best student paper award at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2009, and the 2010 J. K. Aggarwal Prize from the International Association for Pattern Recognition (IAPR).

Abstract
It is an exciting time for computer vision. With the success of new computational architectures for visual processing, such as deep neural networks (e.g., convNets) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. Computer vision is now present among many commercial products, such as digital cameras, web applications, security applications, etc.

The performances achieved by convNets are remarkable and constitute the state of the art on many recognition tasks. But why it works so well? what is the nature of the internal representation learned by the network? I will show that the internal representation can be interpretable. In particular, object detectors emerge in a scene classification task. Then, I will show that an ambient audio signal can be used as a supervisory signal for learning visual representations. We do this by taking advantage of the fact that vision and hearing often tell us about similar structures in the world, such as when we see an object and simultaneously hear it make a sound. We train a convNet to predict ambient sound from video frames, and we show that, through this process, the model learns a visual representation that conveys significant information about objects and scenes.