ICPRAM 2025 Abstracts


Area 1 - Theory and Methods

Full Papers
Paper Nr: 30
Title:

Computation of 2D Discrete Geometric Moments on Quadtrees

Authors:

Paola Magillo and Lidija Čomić

Abstract: We address the problem of computing discrete geometric moments on 2D binary images encoded in the quadtree data structure. We do this by precomputing central moments of the squares of side length 2k, and using the connection between ordinary and central moments. Compared with the state of the art for images encoded as quadtrees, our method considerably improves the efficiency of moment computation.
Download

Paper Nr: 31
Title:

Hyperspectral Image Compression Using Implicit Neural Representation and Meta-Learned Based Network

Authors:

Faisal Z. Qureshi and Shima Rezasoltani

Abstract: Hyperspectral images capture the electromagnetic spectrum for each pixel in a scene. These often store hundreds of channels per pixel, providing significantly more information compared to a comparably sized RGB color image. As the cost of obtaining hyperspectral images decreases, there is a need to create effective ways for storing, transferring, and interpreting hyperspectral data. In this paper, we develop a neural compression method for hyperspectral images. Our methodology relies on transforming hyperspectral images into implicit neural representations, specifically neural functions that establish a correspondence between coordinates (such as pixel locations) and features (such as pixel spectra). Instead of explicitly saving the weights of the implicit neural representation, we record modulations that are applied to a base network that has been “meta-learned.” These modulations serve as a compressed coding for the hyperspectral image. We conducted an assessment of our approach using four benchmarks—Indian Pines, Jasper Ridge, Pavia University, and Cuprite—and our findings demonstrate that the suggested method posts significantly faster compression times when comparedto existing schemes for hyperspectral image compression.
Download

Paper Nr: 41
Title:

LAST: Utilizing Synthetic Image Style Transfer to Tackle Domain Shift in Aerial Image Segmentation

Authors:

Yubo Wang, Ruijia Wen, Hiroyuki Ishii and Jun Ohya

Abstract: Recent deep learning models often struggle with performance degradation due to domain shifts. Addressing domain adaptation in aerial image segmentation is challenging due to the limited availability of training data. To tackle this, we utilized the Unreal Engine to construct a synthetic dataset featuring images captured under diverse conditions such as fog, snow, and nighttime settings. We then proposed a latent space style transfer model that generates alternate domain versions based on the real aerial dataset. This approach eliminates the need for additional annotations on shifted domain data. We benchmarked nine different state-of-the-art segmentation methods on the ISPRS Vaihingen, Potsdam datasets, and their shifted foggy domains. Extensive experiments reveal that domain shift leads to significant performance drops, with an average decrease of - 3.46% mIoU on Vaihingen and -5.22% mIoU on Potsdam. Finally, we adapted the model to perform well in the shifted domain, achieving improvements of +2.97% mIoU on Vaihingen and +3.97% mIoU on Potsdam, while maintaining its effectiveness in the original domain.
Download

Paper Nr: 53
Title:

Distribution Controlled Clustering of Time Series Segments by Reduced Embeddings

Authors:

Gábor Szűcs, Marcell Balázs Tóth and Marcell Németh

Abstract: This paper introduces a novel framework for clustering time series segments, addressing challenges like temporal misalignment, varying segment lengths, and computational inefficiencies. The method combines the Kolmogorov–Smirnov (KS) test for statistical segment comparison and adapted COP-KMeans for clustering with temporal constraints. To enhance scalability, we propose a basepoint selection strategy for embedding the time series segments that reduces the computational complexity from O(n2) to O(n · b) by limiting comparisons to representative basepoints. The approach is evaluated on diverse time series datasets from domains such as motion tracking and medical signals. Results show improved runtime performance over traditional methods, particularly for large datasets. In addition, we introduce a confidence score to quantify the reliability of cluster assignments, with higher accuracy achieved by filtering low-confidence segments. We evaluated clustering performance using the Rand Index (RI), Adjusted Rand Index (ARI), and Normalized Mutual Information (NMI). Our results demonstrate advantageous properties of the method in handling noise and different time series data, making it suitable for large scale applications.
Download

Paper Nr: 54
Title:

Partition Tree Ensembles for Improving Multi-Class Classification

Authors:

Miran Özdogan, Alan Jeffares and Sean Holden

Abstract: We propose the Partition Tree Ensemble (PTE), a novel tree-based ensemble method for classification problems. This differs from previous approaches in that it combines ideas from reduction methods—that decompose multi-class classification problems into binary classification problems—with the creation of specialised base learners that are trained on a subset of the input space. By exploiting multi-class reduction, PTEs adapt concepts from the Trees of Predictors (ToP) method to successfully tackle multi-class classification problems. Each inner node of a PTE splits either the feature space or the label space into subproblems. For each node our method then selects the most appropriate base learner from the provided set of learning algorithms. One of its key advantages is the ability to optimise arbitrary loss functions. Through an extensive experimental evaluation, we demonstrate that our approach achieves significant performance gains over the baseline ToP and AdaBoost methods, across various datasets and loss functions, and outperforms the Random Forest method when the label space exhibits clusters where some classes are more similar to each other than to others.
Download

Paper Nr: 56
Title:

DNN Layers Features Reduction for Out-of-Distribution Detection

Authors:

Mikhaël Presley Kibinda-Moukengue, Alexandre Baussard and Pierre Beauseroy

Abstract: Decision-making in a number of industries, including environmental management, transportation, and public health, is greatly aided by artificial intelligence systems. Nonetheless, to perform well, these systems requires to follow some usage conditions. For instance, the data fed into a classification neural network must come from the same distribution as the training data to maintain the performance measured during test. In practice, however, this condition is not always met and not so easy to guarantee. In particular, for image recognition, it’s possible to submit images that do not contain any learned classes and still receive a firm response from the network. This paper presents an approach to out-of-distribution observation detection applied to deep neural networks (DNNs) for image classification, called DNN Layers Features Reduction for Out-Of-Distribution Detection (DROOD). The principle of DROOD is to construct a decision statistic by successively synthesizing information from the features of all the intermediate layers of the classification network. The method is adaptable to any DNN architecture and experiments show results that outperform reference methods.
Download

Paper Nr: 63
Title:

A Re-Ranking Method Using K-Nearest Weighted Fusion for Person Re-Identification

Authors:

Quang-Huy Che, Le-Chuong Nguyen, Gia-Nghia Tran, Dinh-Duy Phan and Vinh-Tiep Nguyen

Abstract: In person re-identification, re-ranking is a crucial step to enhance the overall accuracy by refining the initial ranking of retrieved results. Previous studies have mainly focused on features from single-view images, which can cause view bias and issues like pose variation, viewpoint changes, and occlusions. Using multi-view features to present a person can help reduce view bias. In this work, we present an efficient re-ranking method that generates multi-view features by aggregating neighbors’ features using K-nearest Weighted Fusion (KWF) method. Specifically, we hypothesize that features extracted from re-identification models are highly similar when representing the same identity. Thus, we select K neighboring features in an unsupervised manner to generate multi-view features. Additionally, this study explores the weight selection strategies during feature aggregation, allowing us to identify an effective strategy. Our re-ranking approach does not require model fine-tuning or extra annotations, making it applicable to large-scale datasets. We evaluate our method on the person re-identification datasets Market1501, MSMT17, and Occluded-DukeMTMC. The results show that our method significantly improves Rank@1 and mAP when re-ranking the top M candidates from the initial ranking results. Specifically, compared to the initial results, our re-ranking method achieves improvements of 9.8%/22.0% in Rank@1 on the challenging datasets: MSMT17 and Occluded-DukeMTMC, respectively. Furthermore, our approach demonstrates substantial enhancements in computational efficiency compared to other re-ranking methods.
Download

Paper Nr: 64
Title:

Comparison of CNN and Transformer Architectures for Robust Cattle Segmentation in Complex Farm Environments

Authors:

Alessandra Lumini, Guilherme Botazzo Rozendo, Maichol Dadi and Annalisa Franco

Abstract: In recent years, computer vision and deep learning have become increasingly important in the livestock industry, offering innovative animal monitoring and farm management solutions. This paper focuses on the critical task of cattle segmentation, an essential application for weight estimation, body condition scoring, and behavior analysis. Despite advances in segmentation techniques, accurately identifying and isolating cattle in complex farm environments remains challenging due to varying lighting conditions and overlapping objects. This study evaluates state-of-the-art segmentation models based on convolutional neural networks and transformers, which leverage self-attention mechanisms to capture long-range image dependencies. By testing these models across multiple publicly available datasets, we assess their performance and generalization capabilities, providing insights into the most effective methods for accurate cattle segmentation in real-world farm conditions. We also explore ensemble techniques, selecting pairs of segmenters with maximum diversity. The results are promising, as an ensemble of only two models improves performance over all stand-alone methods. The findings contribute to improving computer vision-based solutions for livestock management, enhancing their accuracy and reliability in practical applications.
Download

Paper Nr: 100
Title:

CGLearn: Consistent Gradient-Based Learning for Out-of-Distribution Generalization

Authors:

Jawad Chowdhury and Gabriel Terejanu

Abstract: Improving generalization and achieving highly predictive, robust machine learning models necessitates learning the underlying causal structure of the variables of interest. A prominent and effective method for this is learning invariant predictors across multiple environments. In this work, we introduce a simple yet powerful approach, CGLearn, which relies on the agreement of gradients across various environments. This agreement serves as a powerful indication of reliable features, while disagreement suggests less reliability due to potential differences in underlying causal mechanisms. Our proposed method demonstrates superior performance compared to state-of-the-art methods in both linear and nonlinear settings across various regression and classification tasks. CGLearn shows robust applicability even in the absence of separate environments by exploiting invariance across different subsamples of observational data. Comprehensive experiments on both synthetic and real-world datasets highlight its effectiveness in diverse scenarios. Our findings underscore the importance of leveraging gradient agreement for learning causal invariance, providing a significant step forward in the field of robust machine learning. The source code of the linear and nonlinear implementation of CGLearn is open-source and available at: https://github.com/hasanjawad001/CGLearn.
Download

Paper Nr: 102
Title:

Zeroth Order Optimization for Pretraining Language Models

Authors:

Nathan Allaire, Mahsa Ghazvini Nejad, Sébastien Le Digabel and Vahid Partovi Nia

Abstract: The physical memory for training Large Language Models (LLMs) grow with the model size, and are limited to the GPU memory. In particular, back-propagation that requires the computation of the first-order derivatives adds to this memory overhead. Training extremely large language models with memory-efficient algorithms is still a challenge with theoretical and practical implications. Back-propagation-free training algorithms, also known as zeroth-order methods, are recently examined to address this challenge. Their usefulness has been proven in fine-tuning of language models. However, so far, there has been no study for language model pretraining using zeroth-order optimization, where the memory constraint is manifested more severely. We build the connection between the second order, the first order, and the zeroth order theoretically. Then, we apply the zeroth order optimization to pre-training light-weight language models, and discuss why they cannot be readily applied. We show in particular that the curse of dimensionality is the main obstacle, and pave the way towards modifications of zeroth order methods for pre-training such models.
Download

Paper Nr: 116
Title:

SSS: Similarity-Based Scheduled Sampling for Video Prediction

Authors:

Ryosuke Hata and Yoshihisa Shinozawa

Abstract: In video prediction tasks, numerous RNN-based models have demonstrated significant advancements. A well-established approach to enhancing these models during training is scheduled sampling. However, the adjustment of the probability parameter ε (scheduling) has not been adequately addressed, and current configurations are suboptimal for video prediction tasks. This issue arises because prior scheduling strategies depend solely on two factors: a function type and the total number of iterations, without considering the changes by motions, one of the most crucial features in videos. To address this gap, we propose similarity-based scheduled sampling, which accounts for the changes by motions. Specifically, we create difference frames between a given frame at a specific time step and another frame at a different time step for both the model’s predicted output and the ground truth. We then assess the similarity of these difference frames at each iteration, to determine whether the changes by motions are properly trained and to incorporate it into the scheduling. Evaluation experiment shows that proposed method outperforms previous methods. Furthermore, an ablation study confirms the effectiveness of leveraging difference frames and demonstrates the significance of considering the changes by motions.
Download

Paper Nr: 125
Title:

Online Importance Sampling for Stochastic Gradient Optimization

Authors:

Corentin Salaün, Xingchang Huang, Iliyan Georgiev, Niloy Mitra and Gurprit Singh

Abstract: Machine learning optimization often depends on stochastic gradient descent, where the precision of gradient estimation is vital for model performance. Gradients are calculated from mini-batches formed by uniformly selecting data samples from the training dataset. However, not all data samples contribute equally to gradient estimation. To address this, various importance sampling strategies have been developed to prioritize more significant samples. Despite these advancements, all current importance sampling methods encounter challenges related to computational efficiency and seamless integration into practical machine learning pipelines. In this work, we propose a practical algorithm that efficiently computes data importance on-the-fly during training, eliminating the need for dataset preprocessing. We also introduce a novel metric based on the derivative of the loss w.r.t. the network output, designed for mini-batch importance sampling. Our metric prioritizes influential data points, thereby enhancing gradient estimation accuracy. We demonstrate the effectiveness of our approach across various applications. We first perform classification and regression tasks to demonstrate improvements in accuracy. Then, we show how our approach can also be used for online data pruning by identifying and discarding data samples that contribute minimally towards the training loss. This significantly reduce training time with negligible loss in the accuracy of the model.
Download

Paper Nr: 134
Title:

Automated Handwriting Pattern Recognition for Multi-Level Personality Classification Using Transformer OCR (TrOCR)

Authors:

Marzieh Adeli Shamsabad and Ching Yee Suen

Abstract: Automated personality trait assessment from handwriting analysis offers applications in psychology, human-computer interaction, and personal profiling. However, accurately classifying different levels of personality traits remains challenging due to class imbalances in real-world datasets. This study addresses the issue by comparing multi-class and multi-label binary classification methods to predict levels of the Big Five personality traits: Extraversion, Neuroticism, Agreeableness, Conscientiousness, and Openness, each categorized as low, average, and high in an imbalanced dataset of 873 French handwriting samples. A new approach is introduced by adapting the TrOCR pre-trained model for feature extraction, modifying its encoder to capture local and global handwriting features relevant to personality classification. This model is compared with three other pre-trained models: ResNet50 and Vision Transformer base 16 with input resolutions of 224 and 384. Results demonstrate that multi-label binary classification, which treats each trait level as an independent binary task, effectively addresses data imbalance, enhancing accuracy and generalization. The proposed TrOCR model achieves the highest performance, with an accuracy of 84.26%, an F1-score of 83.26%, and an AUROC of 91% on the test set. These findings emphasize the effectiveness of the presented framework for automated multi-level personality trait classification from handwriting in imbalanced datasets.
Download

Paper Nr: 143
Title:

TokenOCR: An Attention Based Foundational Model for Intelligent Optical Character Recognition

Authors:

Charith Gunasekara, Zachary Hamel, Feng Du and Connor Baillie

Abstract: Optical Character Recognition (OCR) plays a pivotal role in digitizing and analyzing text from physical documents. Despite advancements in OCR technologies, challenges persist in handling diverse text layouts, poor-quality images, and complex fonts. In this paper, we present TokenOCR, an attention-based foundational model designed to overcome these limitations by integrating convolutional neural networks and transformer-based architectures. Unlike traditional OCR models that predict individual characters, TokenOCR predicts tokens, significantly enhancing recognition accuracy and efficiency. The model employs a ResNet50 feature extractor, an encoder with adaptive 2D positional embeddings, and a decoder utilizing multi-headed attention mechanisms for robust text recognition. To train TokenOCR, we used a dataset of six million images incorporating various real-world applications. Our training strategy integrates curriculum learning and adaptive learning rate scheduling to ensure efficient model convergence and generalization. Comprehensive evaluations using Word Error Rate (WER) and Character Error Rate (CER) demonstrate that TokenOCR consistently outperforms state-of-the-art models, including Tesseract and TrOCR, in both clean and degraded image conditions. These findings underscore TokenOCR’s potential to set new standards in OCR technology, offering a scalable, efficient, and highly accurate solution for diverse text recognition applications.
Download

Paper Nr: 147
Title:

Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach

Authors:

Alireza Ghaffari, Sharareh Younesian, Boxing Chen, Vahid Partovi Nia and Masoud Asgharian

Abstract: As Large Language Models (LLMs) become increasingly computationally complex, developing efficient deployment strategies, such as quantization, becomes crucial. State-of-the-art Post-training Quantization (PTQ) techniques often rely on calibration processes to maintain the accuracy of these models. However, while these calibration techniques can enhance performance in certain domains, they may not be as effective in others. This paper aims to draw attention to robust statistical approaches that can mitigate such issues. We propose a weight-adaptive PTQ method that can be considered a precursor to calibration-based PTQ methods, guiding the quantization process to preserve the distribution of weights by minimizing the Kullback-Leibler divergence between the quantized weights and the originally trained weights. This minimization ensures that the quantized model retains the Shannon information content of the original model to a great extent, guaranteeing robust and efficient deployment across many tasks. As such, our proposed approach can perform on par with most common calibration-based PTQ methods, establishing a new pre-calibration step for further adjusting the quantized weights with calibration. We show that our pre-calibration results achieve the same accuracy as some existing calibration-based PTQ methods on various LLMs.
Download

Paper Nr: 157
Title:

HI²: Sparse-View 3D Object Reconstruction with a Hybrid Implicit Initialization

Authors:

Pragati Jaiswal and Didier Stricker

Abstract: Accurate 3D object reconstruction is essential for various applications, including mixed reality and medicine. Recent advancements in deep learning-based methods and implicit 3D modelling have significantly enhanced the accuracy of 3D object reconstruction. Traditional methods enable reconstruction from a limited number of images, while implicit 3D modelling is proficient at capturing fine details and complex topologies. In this paper, we present a novel pipeline for 3D object reconstruction that combines the strengths of both approaches. Firstly, we use a 3D occupancy grid to generate a coarse 3D object from a few images. Secondly, we implement a novel and effective sampling strategy to transform the coarse reconstruction into an implicit representation, which is optimized to reduce computation power and training time. This sampling strategy also allows it to be true to scale given actual camera intrinsic and extrinsic parameters. Finally, we refine the implicit representation and extract the 3D object mesh under a differentiable rendering scheme. Experiments on several datasets demonstrate that our proposed approach can reconstruct accurate 3D objects and outperforms state-of-the-art methods in terms of the Chamfer distance and Peak Signal-to-Noise Ratio metrics.
Download

Paper Nr: 158
Title:

Poly-MgNet: Polynomial Building Blocks in Multigrid-Inspired ResNets

Authors:

Antonia van Betteray, Matthias Rottmann and Karsten Kahl

Abstract: The structural analogies of ResNets and Multigrid (MG) methods such as common building blocks like convolutions and poolings where already pointed out by He et al. in 2016. Multigrid methods are used in the context of scientific computing for solving large sparse linear systems arising from partial differential equations. MG methods particularly rely on two main concepts: smoothing and residual restriction / coarsening. Exploiting these analogies, He and Xu developed the MgNet framework, which integrates MG schemes into the design of ResNets. In this work, we introduce a novel neural network building block inspired by polynomial smoothers from MG theory. Our polynomial block from an MG perspective naturally extends the MgNet framework to Poly-Mgnet and at the same time reduces the number of weights in MgNet. We present a comprehensive study of our polynomial block, analyzing the choice of initial coefficients, the polynomial degree, the placement of activation functions, as well as of batch normalizations. Our results demonstrate that constructing (quadratic) polynomial building blocks based on real and imaginary polynomial roots enhances Poly-MgNet’s capacity in terms of accuracy. Furthermore, our approach achieves an improved trade-off of model accuracy and number of weights compared to ResNet as well as compared to specific configurations of MgNet.
Download

Paper Nr: 160
Title:

Euclidean Distance to Convex Polyhedra and Application to Class Representation in Spectral Images

Authors:

Antoine Bottenmuller, Florent Magaud, Arnaud Demortière, Etienne Decencière and Petr Dokladal

Abstract: With the aim of estimating the abundance map from observations only, linear unmixing approaches are not always suitable to spectral images, especially when the number of bands is too small or when the spectra of the observed data are too correlated. To address this issue in the general case, we present a novel approach which provides an adapted spatial density function based on any arbitrary linear classifier. A robust mathematical formulation for computing the Euclidean distance to polyhedral sets is presented, along with an efficient algorithm that provides the exact minimum-norm point in a polyhedron. An empirical evaluation on the widely-used Samson hyperspectral dataset demonstrates that the proposed method surpasses state-of-the-art approaches in reconstructing abundance maps. Furthermore, its application to spectral images of a Lithium-ion battery, incompatible with linear unmixing models, validates the method’s generality and effectiveness.
Download

Short Papers
Paper Nr: 16
Title:

Document Analysis with LLMs: Assessing Performance, Bias, and Nondeterminism in Decision Making

Authors:

Stephen Price and Danielle L. Cote

Abstract: In recent years, large language models (LLMs) have demonstrated their ability to perform complex tasks such as data summarization, translation, document analysis, and content generation. However, their reliability and efficacy in real-world scenarios must be studied. This work presents an experimental evaluation of an LLM for document analysis and candidate recommendation using a set of resumes. Llama3.1, a state-of-the-art open-source model, was tested with 30 questions using data from five resumes. On tasks with a direct answer, Llama3.1 achieved an accuracy of 99.56%. However, on more open-ended and ambiguous questions, performance, and reliability decreased, revealing limitations such as bias toward particular experience, primacy bias, nondeterminism, and sensitivity to question phrasing.
Download

Paper Nr: 24
Title:

Rethinking Model Selection Beyond ImageNet Accuracy for Waste Classification

Authors:

Nermeen Abou Baker and Uwe Handmann

Abstract: Waste streams are growing rapidly due to higher consumption rates, and they present repeating patterns that can be classified with high accuracy due to advances in computer vision. However, collecting and annotating large datasets is time-consuming, but transfer learning can overcome this problem. Selecting the most appropriate pretrained model is critical to maximizing the benefits of transfer learning. Transferability metrics provide an efficient way to evaluate pretrained models without extensive retraining or brute-force methods. This study evaluates six transferability metrics for model selection in waste classification: Negative Conditional Entropy (NCE), Log Expected Empirical Prediction (LEEP), Logarithm of Maximum Evidence (LogME), TransRate, Gaussian Bhattacharyya Coefficient (GBC), and ImageNet accuracy. We evaluate these metrics on five waste classification datasets using 11 pretrained ImageNet models, comparing their performance for finetuning and head-training approaches. Results show that LogME correlates best with transfer accuracy for larger datasets, while ImageNet accuracy and TransRate are more effective for smaller datasets. Our method achieves up to 364x speed-up over brute-force selection, which demonstrates significant efficiency in practical applications.
Download

Paper Nr: 58
Title:

Lifelong Learning Needs Sleep: Few-Shot Incremental Learning Enhanced by Sleep

Authors:

Yuma Kishimoto and Koichiro Yamauchi

Abstract: Catastrophic forgetting due to incremental learning in neural networks is a serious problem. We demonstrate that introducing a sleep period can address this issue from two perspectives. First, it provides a learning period for re-learning old memories. Second, it allows for time to process new learning. We applied a VAE, enhanced by an adapter, for incremental learning of new samples and generating valid samples from a few learning examples. These generated samples are used for the neural network’s re-learning, which also contributes to improving its generalization ability. The experimental results suggest that this approach effectively realizes few-shot incremental learning.
Download

Paper Nr: 61
Title:

Enhanced Generative Data Augmentation for Semantic Segmentation via Stronger Guidance

Authors:

Quang-Huy Che, Duc-Tri Le, Bich-Nga Pham, Duc-Khai Lam and Vinh-Tiep Nguyen

Abstract: Data augmentation is crucial for pixel-wise annotation tasks like semantic segmentation, where labeling requires significant effort and intensive labor. Traditional methods, involving simple transformations such as rotations and flips, create new images but often lack diversity along key semantic dimensions and fail to alter high-level semantic properties. To address this issue, generative models have emerged as an effective solution for augmenting data by generating synthetic images. Controllable Generative models offer data augmentation methods for semantic segmentation tasks by using prompts and visual references from the original image. However, these models face challenges in generating synthetic images that accurately reflect the content and structure of the original image due to difficulties in creating effective prompts and visual references. In this work, we introduce an effective data augmentation pipeline for semantic segmentation using Controllable Diffusion model. Our proposed method includes efficient prompt generation using Class-Prompt Appending and Visual Prior Blending to enhance attention to labeled classes in real images, allowing the pipeline to generate a precise number of augmented images while preserving the structure of segmentation-labeled classes. In addition, we implement a class balancing algorithm to ensure a balanced training dataset when merging the synthetic and original images. Evaluation on PASCAL VOC datasets, our pipeline demonstrates its effectiveness in generating high-quality synthetic images for semantic segmentation. Our code is available at this https URL.
Download

Paper Nr: 62
Title:

Improving Enjoyment of Cultural Heritage Through Recommender Systems, Virtual Tour, and Digital Storytelling

Authors:

M. Casillo, F. Colace, A. Lorusso, D. Santaniello and C. Valentino

Abstract: The integration of Information and Communication Technologies (ICT) within the world of cultural heritage has the role of added value for its enhancement. In particular, improving the enjoyment of cultural Points of Interest by suggesting personalized routes allows for better interaction between users and the cultural site. To this end, this paper aims to introduce an architecture that, by employing Recommendation Systems integrated with the Situation Awareness paradigm, allows for the identification of personalized paths for users through the acquisition of data through smart sensors, which is then processed through the proposed approach, defined as a Multilevel Graph (MuG) approach. This aims to filter through the data’s context and ontological layers to its processing through the Bayesian network, which is identified through structural learning algorithms integrated with the domain’s semantic knowledge. The architecture also incorporates physical and virtual experiences, exploiting the advantages of virtual tours and involving users more by employing digital storytelling techniques. Testing of the proposed architecture based on the MuG approach took place through an offline experiment aimed at evaluating the accuracy of the approach used and an online experiment to test the validity of the designed architecture.
Download

Paper Nr: 68
Title:

A Comparative and Explainable Study of Machine Learning Models for Early Detection of Parkinson's Disease Using Spectrograms

Authors:

Hadjer Zebidi, Zeineb BenMessaoud and Mondher Frikha

Abstract: Parkinson's disease (PD) is a progressive neurodegenerative disorder that originally affects the motor system. Therefore, early diagnosis is essential for effective intervention. Classic diagnostic approaches heavily rely on clinical observations and manual feature extraction, limiting the detection of subtle early vocal impairments. This research examines machine learning (ML) techniques, namely Support Vector Machines (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost), for early identification of PD through the analysis of spectrogram images derived from voice recordings. Mel-Frequency Cepstral Coefficients (MFCC), Short-Time Fourier Transform (STFT), and Mel-Spectrograms were extracted. The improvement of the model was introduced by the Synthetic Minority Over-sampling Technique (SMOTE) and hyperparameter tuning using GridSearchCV (Grid Search with Cross-Validation). Implementing the above methods resulted in significant performance improvements, with XGBoost achieving an accuracy of 95 ± 0.02 on the PC-GITA dataset and SVM attaining 90.74 ± 0.04 on the Neurovoz dataset. Local Interpretable Model-agnostic Explanations (LIME) enhanced model transparency by identifying the significant regions in spectrograms that most influence predictions. This analysis illustrates the efficacy of ML models utilizing SMOTE and GridSearchCV, particularly when augmented by LIME for interpretability, in improving early detection of PD, thereby presenting a feasible approach for clinical implementation.
Download

Paper Nr: 69
Title:

ASTELCO: An Augmented Sparse Time Series Dataset with Generative Models

Authors:

Manuel Sánchez-Laguardia, Gastón García González, Emilio Martinez, Sergio Martinez, Alicia Fernández and Gabriel Gómez

Abstract: In recent years, there has been significant growth in the application of deep learning methods for classification, anomaly detection, and forecasting of time series. However, only some studies address problems involving sparse or intermittent demand time series, since the availability of sparse databases is scarce. This work compares the performance of three data augmentation approaches based on generative models and provides the code used to generate synthetic sparse and non-sparse time series. The experiments are carried out using a newly created sparse time series database, ASTELCO, which is generated from real e-commerce data (STELCO) supplied by a mobile Internet Service Provider. For the sake of reproducibility and as an additional contribution to the community, we make both the STELCO and ASTELCO datasets publicly available, and openly release the implemented code.
Download

Paper Nr: 70
Title:

Evaluating LIME and SHAP in Explaining Malnutrition Classification in Children Under Five

Authors:

Nuru Nabuuso

Abstract: Malnutrition in children under five is a significant public health issue in Uganda, with severe impacts on development and mortality. This paper explores machine learning (ML) models—Support Vector Machines (SVM), eXtreme Gradient Boosting (XGBoost), and Artificial Neural Networks (ANNs) — to predict malnutrition, and reports that XGBoost shows highest predictive accuracy. While the findings on XGBoost employed global model interpretation through feature importance based on permutations, we also introduce SHapley Additive exPlanations (SHAP) for both local and global interpretations. We follow with a focus on SHAP summary plots and bar charts to evaluate feature importance globally. In addition, we report on the comparison between SHAP and Local Interpretable Model-agnostic Explanations (LIME) to analyze the consistency of local explanations provided by both techniques. By contrasting LIME and SHAP, we advance the alignment between local and global interpretations in the context of XGBoost predictions. This comparison highlights the strengths and limitations of each method. Our findings aim to enhance the transparency of ML models and improve decision-making in child health interventions, providing significant insights into public health and ML interpretability.
Download

Paper Nr: 72
Title:

StylePuncher: Encoding a Hidden QR Code into Images

Authors:

Farhad Shadmand, Luiz Schirmer and Nuno Gonçalves

Abstract: Recent advancements in steganography and deep learning have enabled the creation of security methods for imperceptible embedding of data within images. However, many of these methods require substantial time and memory during the training and testing phases. This paper introduces a lighter steganography (also applicable to watermarking purposes) approach, StylePuncher, designed for encoding and decoding 2D binary secret messages within images. The proposed network combines an encoder utilizing neural style transfer techniques with a decoder based on an image-to-image transfer network, offering an efficient and robust solution. The encoder takes a (512×512×3) image along with a high-capacity 2D binary message containing 4096 bits (e.g., a QR code or a simple grayscale logo) and ”punches” the message into the cover image. The decoder, trained using multiple weighted loss functions and noise perturbations, then recovers the embedded message. In addition to demonstrating the success of StylePuncher, this paper provides a detailed analysis of the model’s robustness when exposed to various noise perturbations. Despite its lightweight and fast architecture, StylePuncher achieved a notably high decoding accuracy under noisy conditions, outperforming several state-of-the-art steganography models.
Download

Paper Nr: 86
Title:

Principal Direction 2-Gaussian Fit

Authors:

Nicola Greggio and Alexandre Bernardino

Abstract: In this work we address the problem of Gaussian Mixture Model estimation with model selection through coarse-to-fine component splitting. We describe a split rule, denoted Principal Direction 2-Gaussian Fit, that projects mixture components onto 1D subspaces and fits a two-component model to the projected data. Good split rules are important for coarse-to-fine Gaussian Mixture Estimation algorithms, that start from a single component covering the whole data and proceed with successive phases of component splitting followed by EM steps until a model selection criteria is optimized. These algorithms are typically faster than alternatives but depend critically in the component splitting method. The advantage of our approach with respect to other split rules is twofold: (1) it has a smaller number of parameters and (2) it is optimal in 1D projections of the data. Because our split rule provides a good initialization for the EM steps, it promotes faster convergence to a solution. We illustrate the validity of this algorithm through a series of experiments, showing a better robustness to the choice of parameters this approach to be faster that state-of-the-art alternatives, while being competitive in terms of data fit metrics and processing time.
Download

Paper Nr: 93
Title:

Transformer-Based Geometric Deep Learning for Skeleton Sequence Classification: An Improved Approach

Authors:

Mohamed Amine Mezghich, Mariem Jendoubi, Yassine Assiani and Slim Mhiri

Abstract: This paper presents an improved Geometric Deep Learning Transformer (GDT) architecture for non-Euclidean data classification, with a focus on skeleton-based action recognition. Our model leverages Transformer-based modules to capture both spatial and temporal dynamics in skeleton sequences accurately. Unlike prior approaches, we integrate a manifold learning layer to enhance the understanding of complex geometric patterns in the data, leading to more accurate classification of human actions. Experimental results show that our approach surpasses several state-of-the-art models in 3D human action recognition, as evaluated on benchmark datasets such as NTU RGB+D and NTU RGB+D 120.

Paper Nr: 94
Title:

Data-Centric Optimization of Enrollment Selection in Speaker Identification

Authors:

Long-Quoc Le and Minh-Nhut Ngo

Abstract: In this paper, we introduce a novel method for optimizing enrollment selection in speaker identification systems, with a particular focus on low-resource languages. Unlike traditional approaches that rely on random enrollment samples, our method systematically analyzes pair-wise similarities between enrollment utterances to eliminate poor-quality samples often impacted by noise or adverse environments. By retaining only high-quality and representative utterances, we ensure a more robust speaker profile. This innovative approach, applied to the Vietnam-Celeb dataset using the state-of-the-art ECAPA-TDNN model, delivers substantial performance improvements. Our method boosts accuracy from 73.38% in bad scenarios to 93.62% and increases the F1-score from 72.91% to 95.48%, demonstrating the effectiveness of focusing on quality-driven enrollment selection even in low-resource contexts.
Download

Paper Nr: 97
Title:

Reconstruction of 3D Brain Structures from Clinical 2D MRI Data

Authors:

Rui Shi, Tsukasa Koike, Tetsuro Sekine, Akio Morita and Tetsuya Sakai

Abstract: As the population is aging worldwide, the number of dementia patients increases. Brain MRI is expected to play a crucial role in the prediction of dementia at an early stage. Current 3D brain structure reconstruction methods have strict rules and require a large number of slice images. Routine clinical MRI files contain much fewer slices, but diagnosis relies heavily on information obtained from MRI scans. In this paper, we proposed a method that is able to reconstruct the 3D brain structure with 2D DICOM MRI images within the clinical routine budget, by applying trilinear interpolation. The generated images and structures are evaluated with PSNR and SSIM. The results show that although the details in the generated 2D slices are not ideal, our method is able to reconstruct 3D structures that are highly similar to the original brain structures using only one-fifth of the image slices.
Download

Paper Nr: 109
Title:

Union and Intersection K-Fold Feature Selection

Authors:

Artur J. Ferreira and Mário A. T. Figueiredo

Abstract: Feature selection (FS) is a vast research topic with many techniques proposed over the years. FS techniques may bring many benefits to machine learning algorithms. The combination of FS techniques usually improves the results as compared to the use of one single technique. Recently, the concepts of explainability and interpretability have been proposed in the explainable artificial intelligence (XAI) framework. The recently proposed k-fold feature selection (KFFS) algorithm provides dimensionality reduction and simultaneously yields an output suitable for explainability purposes. In this paper, we extend the KFFS algorithm by performing union and intersection of the individual feature subspaces of two and three feature selection filters. Our experiments performed on 20 datasets show that the union of the feature subsets typically attains better results than the use of individual filters. The intersection also attains adequate results, yielding human manageable (e.g., small) subsets of features, allowing for explainability and interpretability on medical domain data.
Download

Paper Nr: 124
Title:

1D-DiffNPHR: 1D Diffusion Neural Parametric Head Reconstruction Using a Single Image

Authors:

Pragati Jaiswal, Tewodros Amberbir Habtegebrial and Didier Stricker

Abstract: In the field of 3D reconstruction, recent developments, especially in face reconstruction, have shown considerable promise. Despite these achievements, many of these techniques depend heavily on a large number of input views and are inefficient limiting their practicality. This paper proposes a solution to these challenges by focusing on single-view, full 3D head reconstruction. Our approach leverages a 1D diffusion model in combination with RGB image features and a neural parametric latent representation. Specifically, we train a system to learn latent codes conditioned on features extracted from a single input image. The model directly processes the input image at inference to generate latent codes, which are then decoded into a 3D mesh. Our method achieves high-fidelity reconstructions that outperform state-of-the-art approaches such as 3D Morphable Models, Neural Parametric Head Models, and existing methods for head reconstruction.
Download

Paper Nr: 126
Title:

Multiple Importance Sampling for Stochastic Gradient Estimation

Authors:

Corentin Salaün, Xingchang Huang, Iliyan Georgiev, Niloy Mitra and Gurprit Singh

Abstract: We introduce a theoretical and practical framework for efficient importance sampling of mini-batch samples for gradient estimation from single and multiple probability distributions. To handle noisy gradients, our framework dynamically evolves the importance distribution during training by utilizing a self-adaptive metric. Our framework combines multiple, diverse sampling distributions, each tailored to specific parameter gradients. This approach facilitates the importance sampling of vector-valued gradient estimation. Rather than naively combining multiple distributions, our framework involves optimally weighting data contribution across multiple distributions. This adapted combination of multiple importance yields superior gradient estimates, leading to faster training convergence. We demonstrate the effectiveness of our approach through empirical evaluations across a range of optimization tasks like classification and regression on both image and point cloud datasets.
Download

Paper Nr: 142
Title:

Morphing Between Monotonic Spinner Planar Curves Through Radial-Sign Descriptors

Authors:

Emna Ghorbel and Faouzi Ghorbel

Abstract: This paper introduces an innovative morphing method leveraging Radial-Sign descriptors for monotonic spinner shapes, offering a robust, efficient, and computationally refined solution to shape blending challenges. The method encodes two shapes using radial distances and angular sign variations relative to the centroids, respectively, producing complete, stable, and invertible descriptions. By applying weighted interpolation directly to these two descriptors and reconstructing in-between shapes through an inverse formula, the approach ensures smooth, morphologically coherent transitions while preserving essential geometric properties. Unlike conventional curvature or registration-based techniques, which often require intensive post-processing or face limitations with significantly different shapes, the proposed method adeptly blends both similar and dissimilar shapes, including those with differing turning number, by introducing additional turns in simpler shapes to ensure continuity and coherence.
Download

Paper Nr: 154
Title:

A Framework for Identifying Underspecification in Image Classification Pipeline Using Post-Hoc Analyzer

Authors:

Prabhat Parajuli and Teeradaj Racharak

Abstract: Underspecification — a failure mode where multiple models perform well during development but fail to generalize to new, unseen data due to reliance on insufficient or spurious features — remains critical challenge in machine learning (ML) and deep learning (DL). In this paper, we focus on a specific aspect of underspecification: the inconsistency in feature learning. We hypothesize that models with similar performance should exhibit consistent behavior in terms of feature reliance. However, in practice — especially in deep learning —this consistency is often lacking due to various factors influencing the learning process. To uncover where this inconsistency occurs, we propose a framework leveraging XAI techniques (specifically LIME and SHAP) to identify underspecification by analyzing inconsistencies across ML pipeline components, including feature extractors, optimizers, and weight initialization. Experiments on MNIST, Imagenette, and Cats vs Dogs reveal significant variability in feature reliance, particularly due to the choice of feature extractor. This variability highlights how different factors contribute to the learning of varied features, ultimately leading to potential underspecification. While this study focuses on the impact of specific pipeline components, our framework can be extended to analyze other factors contributing to underspecification in machine learning systems.
Download

Paper Nr: 155
Title:

Pre-Training Deep Q-Networks Eliminates the Need for Target Networks: An Empirical Study

Authors:

Alexander Lindström, Arunselvan Ramaswamy and Karl-Johan Grinnemo

Abstract: Deep Q-Learning is an important algorithm in the field of Reinforcement Learning for automated sequential decision making problems. It trains a neural network called the Deep Q Network (DQN) to find an optimal policy. Training is highly unstable with high variance. A target network is used to mitigate these problems, but leads to longer training times and, high training data and very large memory requirements. In this paper, we present a two phase pre-trained online training procedure that eliminates the need for a target network. In the first - offline - phase, the DQN is trained using expert actions. Unlike previous literature that tries to maximize the probability of picking the expert actions, we train to minimize the usual squared Bellman loss. Then, in the second - online - phase, it continues to train while interacting with an environment (simulator). We show, empirically, that the target network is eliminated; training variance is reduced; training is more stable; when the duration of pre-training is carefully chosen the rate of convergence (to an optimal policy) during the online training phase is faster; the quality of the final policy found is at least as good as the ones found using traditional methods.
Download

Paper Nr: 159
Title:

Exploring Communication in Multi-Agent Reinforcement Learning Under Agent Malfunction

Authors:

Rafael Pina and Varuna De Silva

Abstract: Multi-Agent Reinforcement Learning (MARL) has grown into one of the most popular methods to tackle complex problems within multi-agent systems. Cooperative multi-agent systems denote scenarios where a team of agents must work together to achieve a certain common objective. Among several challenges studied in MARL, one problem is that agents might unexpectedly start acting abnormally, i.e., they can malfunction. Naturally, this malfunctioning agent will affect the behaviour of the team as a whole. In this paper, we investigate this problem and use the concepts of communication within the MARL literature to analyse how agents are affected when a malfunction happens within the team. We leverage popular MARL methods and build on them a communication module to allow the agents to broadcast messages to each other. Our results show that, while communication can boost learning in normal conditions, it can become redundant when malfunctions occur. We look into the team performances when malfunctions happen and we analyse in detail the patterns in the messages that the agents generate to communicate with each other. We observe that these can be strongly affected by malfunctions and we highlight the need to build appropriate architectures that can still leverage the power of communication in MARL when unexpected malfunctions happen.
Download

Paper Nr: 161
Title:

Extracting and Modeling Tabular Data from Marine Geology Publications into a Heterogeneous Information Network

Authors:

Muhammad Asif Suryani, Ewa Burwicz-Galerne, Brigitte Mathiak, Klaus Wallmann and Matthias Renz

Abstract: Scientific publications serve as a source of disseminating information across research communities, often containing diverse data elements such as plain-text, tables, and figures. Tables in particular offer a structured presentation of essential research data, enabling efficient information access. Automatic extraction of tabular data alongside contextual information from scientific publications can significantly enhance research work-flows and integrate more research data into scholarly research cycle, particularly supporting Research Data Management (RDM). In marine geology, the researchers conduct expeditions at oceanographic locations and accumulate substantial amounts of valuable data such as Sedimentation Rate (SR), Mass Accumulation Rate (MAR) alongside relevant contextual information, often enriched with spatio-temporal context in tables of publications. These expeditions are costly and time intensive, emphasizing on the value of making such data more accessible and reusable. This paper introduces an end to end approach to extract and model heterogeneous tabular data from marine geology publications. Our approach extracts metadata and tabular content from publications, modeling them into a Heterogeneous Information Network (HIN). The network uncovers hidden relationships and patterns across multiple documents, offering new insights and facilitating enhanced data referencing. Experimental results and exploration on marine geology datasets demonstrate the effectiveness of our approach, showcasing its potential to support research data management and data driven scientific exploration.
Download

Paper Nr: 17
Title:

Orientation of a Multicopter in Space Based on Recognition of Images of the Earth's Surface

Authors:

Ramin Rzayev, Azer Kerimov and Vagif Aliyev

Abstract: Existing UAV counteraction systems and electronic warfare systems pose a great threat not only to the autopilot of a multicopter, but also to their remote control, which in most cases leads to their loss. One of the ways to prevent such loss is the ability to return the multicopter to the starting point along the reverse trajectory based on sequential recognition of images of the Earth's surface, which were recorded by a camera in the form of pictures during the multicopter's flight. Recognition methods based on Wavelet and Fourier transforms of signals, as well as their additive convolution, are used as technical vision tools for a multicopter. The choice of the optimal recognition method is justified by comparing them based on families of one-dimensional signals artificially created by shifting relative to the base signal formed after extracting recognition features, segmenting and linearizing a given image of the Earth's surface.

Paper Nr: 34
Title:

Stance Detection in Twitter Conversations Using Reply Support Classification

Authors:

Parul Khandelwal, Preety Singh, Rajbir Kaur and Roshni Chakraborty

Abstract: During crisis, social media platforms like Twitter play a crucial role in disseminating information and offering emotional support. Understanding the conversations among people is essential for evaluating the overall impact of the crisis on the public. In this paper, we focus on classifying the replies to tweets during the “Fall of Kabul” event into three classes: supporting, unbiased, and opposing. To achieve this goal, we proposed two frameworks. We used LSTM layers for sentence/word-level feature extraction for classification. We also employed a BERT-based approach where the text of both the tweet and the reply is concatenated.Our evaluation on real-world crisis data showed that the BERT-based architecture outperformed the LSTM models. It produced an F1-score of 0.726 for the opposing class, 0.738 for the unbiased class, and 0.729 for the supportive class. These results highlight the robustness of contextualized embeddings in accurately identifying the stance of replies within Twitter conversations through tweet-reply pairs.
Download

Paper Nr: 87
Title:

Boosting Language Models for Real-Word Error Detection

Authors:

Corina Masanti, Hans-Friedrich Witschel and Kaspar Riesen

Abstract: With the introduction of transformer-based language models, research in error detection in text documents has significantly advanced. However, some significant research challenges remain. In the present paper, we aim to address the specific challenge of detecting real-word errors, i.e., words that are syntactically correct but semantically incorrect given the sentence context. In particular, we research three categories of frequent real-word errors in German, viz. verb conjugation errors, case errors, and capitalization errors. To address the scarcity of training data, especially for languages other than English, we propose to systematically incorporate synthetic data into the training process. To this end, we employ ensemble learning methods for language models. In particular, we propose to adapt the boosting technique to language model learning. Our experimental evaluation reveals that incorporating synthetic data in a non-systematic way enhances recall but lowers precision. In contrast, the proposed boosting approach improves the recall of the language model while maintaining its high precision.
Download

Paper Nr: 92
Title:

Machine Learning-Driven Classification of Polyethylene (HDPE, LDPE) via Raman Spectroscopy

Authors:

Evangelos Stergiou, Fotios K. Konstantinidis, C. Stefani, G. Arvanitakis, Georgios Tsimiklis and Angelos Amditis

Abstract: Polymer industries are currently focusing on developing new methods for identifying Polyethylene (PE) categories through rapid and non-destructive characterization techniques (NDT) to improve their production processes or recycling process control. However, NDT for classification is challenging for PE categories due to their identical chemical structures. This work presents a data-driven method for classifying PE to its two main categories, Low-Density Polyethylene (LDPE) and High-Density Polyethylene (HDPE). The method is using Raman spectroscopy, with the spectrums being processed to select the features, which are decisive for the classification of the different types of PE. PE samples in the form of granules are subjected to spectroscopic measurements, followed by data pre-processing in order for the signals to be enhanced. Using a Gradient Boosting model, the selected spectral features were used to train and validate the model. The model achieved an accuracy rate of 97 %, indicating the potential of the proposed method for rapid and accurate separation of LDPE and HDPE. This performance is not limited to PE granules but also to different plastic types (e.g. film, bottles, etc.). This approach offers a rapid method to classify polyethylene types, making the method suitable for industrial uses.
Download

Paper Nr: 117
Title:

Domain Generalization Using Category Information Independent of Domain Differences

Authors:

Reiji Saito and Kazuhiro Hotta

Abstract: Domain generalization is a technique aimed at enabling models to maintain high accuracy when applied to new environments or datasets (unseen domains) that differ from the datasets used in training. Generally, the accuracy of models trained on a specific dataset (source domain) often decreases significantly when evaluated on different datasets (target domain). This issue arises due to differences in domains caused by varying environmental conditions such as imaging equipment and staining methods. Therefore, we undertook two initiatives to perform segmentation that does not depend on domain differences. We propose a method that separates category information independent of domain differences from the information specific to the source domain. By using information independent of domain differences, our method enables learning the segmentation targets (e.g., blood vessels and cell nuclei). Although we extract independent information of domain differences, this cannot completely bridge the domain gap between training and test data. Therefore, we absorb the domain gap using the quantum vectors in Stochastically Quantized Variational AutoEncoder (SQ-VAE). In experiments, we evaluated our method on datasets for vascular segmentation and cell nucleus segmentation. Our methods improved the accuracy compared to conventional methods.
Download

Paper Nr: 122
Title:

A New Cluster Validation Index Based on Stability Analysis

Authors:

Adane Nega Tarekegn, Bjørnar Tessem and Fazle Rabbi

Abstract: Clustering is a frequently employed technique across various domains, including anomaly detection, recommender systems, video analysis, and natural language processing. Despite its broad application, validating clustering results has become one of the main challenges in cluster analysis. This can be due to factors such as the subjective nature of clustering evaluation, lack of ground truth in many real-world datasets, and sensitivity of evaluation metrics to different cluster shapes and algorithms. While there is extensive literature work in this area, developing an evaluation method that is both objective and quantitative is still challenging task requiring more effort. In this study, we proposed a new Clustering Stability Assessment Index (CSAI) that can provide a unified and quantitative approach to measure the quality and consistency of clustering solutions. The proposed CSAI validation index leverages a data resampling approach and prediction analysis to assess clustering stability by using multiple features associated within clusters, rather than the traditional centroid-based method. This approach enables reproducibility in data clustering and operates independently of the clustering algorithms used, which makes it adaptable to various methods and applications. To evaluate the effectiveness and generality of the CSAI, we have carried out an extensive experimental analysis using various clustering algorithms and benchmark datasets. The obtained results show that CSAI demonstrates competitive performance compared to existing cluster validation indices and effectively measures the quality and robustness of clustering results across multiple samples.
Download

Paper Nr: 123
Title:

A Concept for Requirements-Driven Identification and Mitigation of Dataset Gaps for Perception Tasks in Automated Driving

Authors:

Mohamed Sabry Moustafa, Maarten Bieshaar, Andreas Albrecht and Bernhard Sick

Abstract: The development of reliable perception machine learning (ML) models is critical for the safe operation of automated vehicles. However, acquiring sufficient real-world data for testing and training these models is not only time-consuming and dependent on chance, but also presents significant risks in safety-critical situations. To address these challenges, we propose a novel requirements-driven, data-driven methodology leveraging state-of-the-art synthetic data generation techniques in combination with tailoring real-world datasets towards task-specific needs. Our approach involves creating synthetic scenarios that are challenging or impossible to capture in real-world environments. These synthetic datasets are designed to enhance existing real-world datasets by addressing coverage gaps and improving model performance in cases represented by such gaps in real world. Through a rigorous analysis based on predefined safety requirements, we systematically differentiate between gaps arising from insufficient knowledge about the system operational design domain (e.g., underrepresented scenarios) and those inherent to data. This iterative process enables identifying and mitigating underrepresented scenarios, particularly in safety-critical and underrepresented scenarios, leading to local improvement in model performance. By incorporating synthetic data into the training process, our approach effectively mitigates model limitations and contributes to increased system reliability, in alignment with safety standards such as ISO-21448 (SOTIF).
Download

Paper Nr: 151
Title:

Toward Optimized Predictive Maintenance for Vehicle Systems: Deep Learning-Based Anomaly Detection Using CAN Traffic

Authors:

Bournane Abbache, Mawloud Omar and Siham Bouchelaghem

Abstract: This paper introduces a deep learning-based framework for predictive maintenance in vehicle systems using Controller Area Network (CAN) traffic data. Modern vehicles rely heavily on electronic components, making early fault detection crucial for ensuring safety and reliability. We propose an LSTM-based anomaly detection model that identifies irregularities in dynamic vehicle parameters, including speed, engine RPM, steering wheel angle, vehicle suspension height, and headlight position. CAN bus traffic data was meticulously extracted and preprocessed from a real vehicle prototype to train the model, which autonomously detects anomalies and potential failures. Our experimental results demonstrate the model’s effectiveness in capturing temporal dependencies within CAN data, enabling precise anomaly detection to support intelligent predictive maintenance strategies. This proactive approach minimizes downtime, enhances system reliability, and improves vehicle safety. To foster further research and collaboration, we make the generated dataset publicly available, advancing innovation in vehicle diagnostics and anomaly detection.
Download

Area 2 - Applications

Full Papers
Paper Nr: 22
Title:

Assessment of Training Progression on a Surgical Simulator Using Machine Learning and Explainable Artificial Intelligence Techniques

Authors:

Constantinos Loukas and Konstantina Prevezanou

Abstract: Surgical training on VR simulators provides an efficient education paradigm in laparoscopic surgery. Most methods for skills assessment focus on the analysis of video and kinematic data for self-proclaimed skill classification and technical score prediction. In this paper we evaluate a machine learning (ML) framework for classifying the trainee’s performance with respect to the phase of training progression (beginning vs. end of training and beginning vs. middle vs. end of training). In addition, we leverage techniques from the field of Explainable Artificial Intelligence (XAI) to obtain interpretations on the employed black-box ML classifiers. Three surgical training tasks with significant educational value were selected from a training curriculum followed by 23 medical students. Five machine learning algorithms and two model-agnostic XAI methods were evaluated using performance metrics generated by the simulator during task performance. For all surgical tasks, the accuracy was >84% and >86% in the 2- and 3-class classification experiments, respectively. The XAI methods seem to agree on the relative impact of each performance metric. Features related to hand-eye coordination and bimanual dexterity (e.g. economy of movements, instrument pathlength and number of movements), play the most important role in explaining the classification results.
Download

Paper Nr: 59
Title:

Improving Floating Wind Turbine Stability with Evolutionary Computation for TMD Optimization

Authors:

Thayza Melo, Luciana Faletti Almeida and Juan G. Lazo Lazo

Abstract: Wind turbines in general allow the conversion of wind kinetic energy into electrical energy, but their installation on land is becoming increasingly complicated, due to wind speed, lower energy generation, environmental, acoustic and visual aspects, land use, among others. In this sense, offshore wind generation has advantages such as stronger and more constant winds, lower visual and acoustic impact, greater generation capacity, development close to large cities, among others. Offshore wind turbines have great potential to transform the global energy matrix, especially with the use of floating platforms that enable energy generation in deep waters. However, these systems face significant challenges, such as pendulum loads and movements induced by winds and waves that cause fatigue to the structure. This work proposes the use of evolutionary computing techniques, through genetic algorithms, to optimize a passive structural control with tuned mass damping devices (TMDs), installed in the nacelle of Floating Offshore Wind Turbines (FOWTs) of the Barge type, aiming to mitigate these pendular effects. The TMDs are configured to act in the fore-aft and lateral-lateral directions, and the optimization considered the standard deviation of the tower fatigue as a fitness function, in addition to including stroke limits to adapt to the nacelle dimensions. The optimization was performed under the free decay condition, i.e., simplified conditions and application of initial inclinations to the platform. The simulations, conducted in the FAST-SC (Fatigue, Aerodynamics, Structures, and Turbulence – Structural Control) software, demonstrated a reduction of more than 36% in the structural fatigue of the tower compared to systems without structural control and an improvement of more than 11% compared to systems with unidirectional TMD. The results reinforce the effectiveness of passive structural control with bidirectional TMD in mitigating vibrations and increasing the reliability of floating offshore turbines, offering an efficient approach to improve the structural reliability of the system.
Download

Paper Nr: 74
Title:

An Optimized and Accelerated Object Instance Segmentation Model for Low-Power Edge Devices

Authors:

Diego Bellani, Valerio Venanzi, Shadi Andishmand, Luigi Cinque and Marco Raoul Marini

Abstract: Deep learning, for sustainable applications or in cases of energy scarcity, requires using available, cost-effective, and energy-efficient accelerators together with efficient models. We explore using the Yolact model, for instance, segmentation, running on a low power consumption device (e.g., Intel Neural Computing Stick 2 (NCS2)), to detect and segment-specific objects. We have changed the Feature Pyramid Network (FPN) and pruning techniques to make the model usable for this application. The final model achieves a noticeable result in Frames Per Second (FPS) on the edge device while achieving a consistent mean Average Precision (mAP).
Download

Paper Nr: 85
Title:

Domain-Incremental Semantic Segmentation for Autonomous Driving Under Adverse Driving Conditions

Authors:

Shishir Muralidhara, René Schuster and Didier Stricker

Abstract: Semantic segmentation for autonomous driving is an even more challenging task when faced with adverse driving conditions. Standard models trained on data recorded under ideal conditions show a deteriorated performance in unfavorable weather or illumination conditions. Fine-tuning on the new task or condition would lead to overwriting the previously learned information resulting in catastrophic forgetting. Adapting to the new conditions through traditional domain adaption methods improves the performance on the target domain at the expense of the source domain. Addressing these issues, we propose an architecture-based domain-incremental learning approach called Progressive Semantic Segmentation (PSS). PSS is a task-agnostic, dynamically growing collection of domain-specific segmentation models. The task of inferring the domain and subsequently selecting the appropriate module for segmentation is carried out using a collection of convolutional autoencoders. We extensively evaluate our proposed approach using several datasets at varying levels of granularity in the categorization of adverse driving conditions. Furthermore, we demonstrate the generalization of the proposed approach to similar and unseen domains.
Download

Paper Nr: 89
Title:

Towards Secure Biometric Solutions: Enhancing Facial Recognition While Protecting User Data

Authors:

Jose Silva, Aniana Cruz, Bruno Sousa and Nuno Gonçalves

Abstract: This paper presents a novel approach to the storage of facial images in databases designed for biometric authentication, with a primary focus on user privacy. Biometric template protection encompasses a variety of techniques aimed at safeguarding users’ biometric information. Generally, these methods involve the application of transformations and distortions to sensitive data. However, such alterations can frequently result in diminished accuracy within recognition systems. We propose a deformation process to generate temporary codes that facilitate the verification of registered biometric features. Subsequently, facial recognition is performed on these registered features in conjunction with new samples. The primary advantage of this approach is the elimination of the need to store facial images within application databases, thereby enhancing user privacy while maintaining high recognition accuracy. Evaluations conducted using several benchmark datasets including AgeDB-30, CALFW, CPLFW, LFW, RFW, XQLFW - demonstrate that our proposed approach preserves the accuracy of the biometric system. Furthermore, it mitigates the necessity for applications to retain any biometric data, images, or sensitive information that could jeopardize users’ identities in the event of a data breach. The solution code, benchmark execution, and demo are available at: https://bc1607.github.io/FRS- ProtectingData.
Download

Paper Nr: 120
Title:

Object-Centric 2D Gaussian Splatting: Background Removal and Occlusion-Aware Pruning for Compact Object Models

Authors:

Marcel Rogge and Didier Stricker

Abstract: Current Gaussian Splatting approaches are effective for reconstructing entire scenes but lack the option to target specific objects, making them computationally expensive and unsuitable for object-specific applications. We propose a novel approach that leverages object masks to enable targeted reconstruction, resulting in object-centric models. Additionally, we introduce an occlusion-aware pruning strategy to minimize the number of Gaussians without compromising quality. Our method reconstructs compact object models, yielding object-centric Gaussian and mesh representations that are up to 96% smaller and up to 71% faster to train compared to the baseline while retaining competitive quality. These representations are immediately usable for downstream applications such as appearance editing and physics simulation without additional processing.
Download

Paper Nr: 148
Title:

Deep Neural Network Architectures for Advanced Hiking Map Generation

Authors:

Olivier Schirm, Maxime Devanne, Jonathan Weber, Arnaud Lecus, Germain Forestier and Cédric Wemmert

Abstract: The automation of hiking map generation using deep learning represents a pivotal advancement in geospatial analysis. This study investigates the application of neural network architectures to derive accurate hiking maps from GPS trajectory data, exclusively collected via the Visorando mobile application. By exploring the utility of 17 distinct raster features derived from geospatial data, we identify the heatmap as the most effective input for mapping intricate trail networks, achieving superior performance across accuracy, segmentation, and connectivity metrics. Among various architectures evaluated, HRNet emerged as the most efficient model, demonstrating exceptional results when combined with optimal input features, significantly outperforming state-of-the-art approaches in intersection detection and trail segmentation. This research introduces a novel framework for converting vector-based GPS traces into rasterized data suitable for convolutional neural networks, overcoming challenges like noisy inputs and terrain variability. The findings establish a new benchmark for efficiency and accuracy in hiking map generation, offering valuable insights for the broader field of automated map inference reliant on GPS data. Future work will explore direct spatio-temporal processing of vector data, eliminating raster conversion for enhanced scalability and precision.
Download

Paper Nr: 156
Title:

Self-Supervised Transformers for Long-Term Prediction of Landsat NDVI Time Series

Authors:

Ido Faran, Nathan S. Netanyahu, Elena Roitberg and Maxim Shoshany

Abstract: Long-term satellite image time-series (SITS) analysis presents significant challenges in remote sensing, especially for heterogeneous Mediterranean landscapes, due to complex temporal dependencies, pronounced seasonality, and overarching global trends. We propose Self-Supervised Transformers for Long-Term Prediction (SST-LTP), a novel framework that combines self-supervised learning, temporal embeddings, and a Transformer-based architecture to analyze multi-decade Landsat data. Our approach leverages a self-supervised pretext task to train Transformers on unlabeled data, incorporating temporal embeddings to capture both long-term trends and seasonal variations. This architecture effectively models intricate temporal patterns, enabling accurate predictions of the Normalized Difference Vegetation Index (NDVI) across diverse temporal horizons. Using Landsat data spanning 1984–2024, SST-LTP achieves a Mean Absolute Error (MAE) of 0.0338 and an R2 value of 0.8337, outperforming traditional methods and other neural network architectures. These results highlight SST-LTP as a robust tool for long-term environmental monitoring and analysis.
Download

Short Papers
Paper Nr: 11
Title:

SMVLift: Lifting Semantic Segmentation to 3D on XR Devices

Authors:

Marcus Valtonen Örnhag, Puren Güler, Anastasia Grebenyuk, Hiba Alqaysi and Tobias Widmark

Abstract: Creating an immersive mixed-reality experience, where virtual objects are seamlessly blending into physical environments, requires a careful integration of 3D environmental understanding with the underlying contextual semantics. State-of-the-art methods in this field often rely on large and dense 3D point clouds, which are not feasible for real-time performance in standalone XR headsets. We introduce Sparse Multi-View Lifting (SMVLift), a lightweight 3D instance segmentation method capable of running on constrained hardware, which demonstrates on par or superior performance compared to a state-of-the-art method while being significantly less computationally demanding. Lastly, we use the framework in downstream XR applications with satisfactory performance on real hardware.
Download

Paper Nr: 20
Title:

A U-Net-Based Temperature Bias Correction Method for the REMO2015 Regional Climate Model in CORDEX-EA

Authors:

Shibin Zheng, Chenwei Shen and Bin Li

Abstract: Regional climate models suffer from insufficient resolution and deficiencies in their dynamic processes, leading to systematic biases in surface air temperature simulations that require correction. In this research, a deep learning bias correction model, CE-MS-Unet, is proposed. This model incorporates multi-scale residual blocks and calendar month data to improve surface air temperature simulations of the REMO2015 regional climate model during the second phase of the Coordinated Regional Downscaling Experiment East Asia (CORDEX-EA-II) over mainland China. Experimental results indicate that, compared to Linear Scaling, Quantile Delta Mapping, and the deep learning model CU-net, CE-MS-Unet performs better in correcting climate averages and seasonal cycles, resulting in corrected data with greater overall agreement and improved spatial correlation. It effectively reduces biases and provides more accurate climate predictions. This study offers new insights and methods to improve the bias correction of temperature in regional climate models.
Download

Paper Nr: 23
Title:

An Easy-to-Use System for Tracking Robotic Platforms Using Time-of-Flight Sensors in Lab Environments

Authors:

André Kirsch and Jan Rexilius

Abstract: The acquisition of accurate tracking data is a common problem in scientific research. When developing new algorithms and AI networks for the localization and navigation of mobile robots and MAVs, they need to be evaluated against true observations. Off-the-shelf systems for capturing ground truth data often come at a high cost, since they typically include multiple expensive sensors and require a special setup. We see the need for a simpler solution and propose an easy-to-use system for small scale tracking data acquisition in research environments using a single or multiple sensors with possibly already available hardware. The system is able to track mobile robots moving on the ground, as well as MAVs that are flying through the room. Our solution works with point clouds and allows the use of Time-of-Flight based sensors like LiDAR. The results show that the accuracy of our system is sufficient to use as ground truth data with a low-centimeter mean error.
Download

Paper Nr: 33
Title:

Multi-View Skeleton Analysis for Human Action Segmentation Tasks

Authors:

Laura Romeo, Cosimo Patruno, Grazia Cicirelli and Tiziana D’Orazio

Abstract: Human Action Recognition and Segmentation have been attracting considerable attention from the scientific community in the last decades. In literature, various types of data are used for human monitoring, each with its advantages and challenges, such as RGB, IR, RGBD, and Skeleton data. Skeleton data abstracts away detailed appearance information, focusing instead on the spatial configuration of body joints and their temporal dynamics. Moreover, Skeleton representation can be robust to changes in appearance and viewpoint, making it useful for action segmentation. In this paper, we focus on the use of Skeleton data for human action segmentation in a manufacturing context by using a multi-camera system composed of two Azure Kinect cameras. This work aims to investigate action segmentation performance by using projected skeletons or “synthetic” ones. When one of the cameras fails to provide skeleton data due to occlusion or being out of range, the information coming from the other view is used to fill the missing skeletons. Furthermore, synthetic skeletons are generated from the combination of the two skeletons by considering also the reliability of each joint. Experiments on the HARMA dataset demonstrate the effects of the skeleton combinations on human action segmentation.
Download

Paper Nr: 44
Title:

Action Recognition in Law Enforcement: A Novel Dataset from Body Worn Cameras

Authors:

Sameer Hans, Jean-Luc Dugelay, Mohd Rizal Mohd Isa and Mohammad Adib Khairuddin

Abstract: Over the past decade, there has been a notable increase in the integration of body worn cameras (BWCs) in many professional settings, particularly in law enforcement. BWCs serve as valuable tools for enhancing transparency, accountability, and security by providing real-time, first-person perspective recordings of interactions and events. These devices capture vast amounts of video data, which can offer critical insights into the behaviors and actions of individuals in diverse scenarios. This paper aims to explore the intersection of BWCs and action recognition methodologies. We introduce FALEBaction: a multimodal dataset for action recognition using body worn cameras, with actions relevant to BWCs and law enforcement usage. We investigate the methodologies employed in extracting meaningful patterns from BWC footage, the effectiveness of deep learning models in recognizing similar actions, and the potential applications and implications of these advancements. By focusing on actions relevant to law enforcement scenarios, we ensure that our dataset meets the practical needs of the authorities and researchers aiming to enhance public safety through advanced video analysis technologies. The entire dataset can be obtained upon request from the authors to facilitate further research in this domain.
Download

Paper Nr: 52
Title:

Game State and Spatio-Temporal Action Detection in Soccer Using Graph Neural Networks and 3D Convolutional Networks

Authors:

Jérémie Ochin, Guillaume Devineau, Bogdan Stanciulescu and Sotiris Manitsaris

Abstract: Soccer analytics rely on two data sources: the player positions on the pitch and the sequences of events they perform. With around 2000 ball events per game, their precise and exhaustive annotation based on a monocular video stream remains a tedious and costly manual task. While state-of-the-art spatio-temporal action detection methods show promise for automating this task, they lack contextual understanding of the game. Assuming professional players’ behaviors are interdependent, we hypothesize that incorporating surrounding players’ information such as positions, velocity and team membership can enhance purely visual predictions. We propose a spatio-temporal action detection approach that combines visual and game state information via Graph Neural Networks trained end-to-end with state-of-the-art 3D CNNs, demonstrating improved metrics through game state integration.
Download

Paper Nr: 55
Title:

FFAD: Fixed-Position Few-Shot Anomaly Detection for Wire Harness Utilizing Vision-Language Models

Authors:

Powei Liao, Pei-Chun Chien, Hiroki Tsukida, Yoichi Kato and Jun Ohya

Abstract: Anomaly detection in wire harness assembly for automobiles is a challenging task due to the deformable nature of cables and the diverse assembly environments. Traditional deep learning methods require large datasets, which are difficult to obtain in manufacturing settings. To address these challenges, we propose Fixed-Position Few-Shot Anomaly Detection (FFAD), a method that leverages pre-trained vision-language models, specifically CLIP, to perform anomaly detection with minimal data. By capturing images from fixed positions and using position-based learnable prompts and visual augmentation, FFAD can detect anomalies in complex wire harness situations without the need for extensive data collection. Our experiments demonstrated that FFAD achieves over 90% accuracy with fewer than 16 shots per class, outperforming existing few-shot learning methods.
Download

Paper Nr: 71
Title:

Speaker Verification Enhancement via Speaking Rate Dynamics in Persian Speechprints

Authors:

Nina Hosseini-Kivanani, Homa Asadi and Christoph Schommer

Abstract: This paper investigates the impact of speaking rate variation on speaker verification using a hybrid feature approach that combines Mel-Frequency Cepstral Coefficients (MFCCs), their dynamic derivatives (delta and delta-delta), and vowel formants. To enhance system robustness, we also applied data augmentation techniques such as time-stretching, pitch-shifting, and noise addition. The dataset comprises recordings of Persian speakers at three distinct speaking rates: slow, normal, and fast. Our results show that the combined model integrating MFCCs, delta-delta features, and formant frequencies significantly outperforms individual feature sets, achieving an accuracy of 75% with augmentation, compared to 70% without augmentation. This highlights the benefit of leveraging both spectral and temporal features for speaker verification under varying speaking conditions. Furthermore, data augmentation improved the generalization of all models, particularly for the combined feature set, where precision, recall, and F1-score metrics showed substantial gains. These findings underscore the importance of feature fusion and augmentation in developing robust speaker verification systems. Our study contributes to advancing speaker identification methodologies, particularly in real-world applications where variability in speaking rate and environmental conditions presents a challenge.
Download

Paper Nr: 77
Title:

A Hierarchical Classification for Automatic Assessment of the Reception Quality Using Videos of Volleyball and Deep Learning

Authors:

Shota Nako, Hiroyuki Ogata, Taiji Matsui, Itsuki Hamada and Jun Ohya

Abstract: To automate the assessment of the reception quality in volleyball games, this paper proposes a hierarchical classification method that uses deep learning methods that are trained using single view videos acquired in actual matches and the data recorded manually using Data Volley. The hierarchical classification consists of the three steps: the first step for judging whether the player is in front of (Front) or behind (Back) the net in the court, the second step for discriminating the best quality pass (A-pass) and second best pass (B-pass) vs. the third best pass (C-pass), and the third step for discriminating A-pass vs B-pass. Experiments that compare six class classification with the proposed hierarchical classification were conducted, where the former classifies the six classes: Front A-pass, Front B-pass, Front C-pass and Back A-pass, Back B-pass, Back C-pass. Two TimeSformer models were used as video classification models: TimeSformer-L and TimeSformer-HR. Also, two data sets of different lengths were used. Dataset1 is longer than Dataset2. Different sampling rates were set for each combination of dataset and model. Experimental results demonstrate that the proposed hierarchical classification outperforms the six class classification, clarifying the best combinations of TimeSformer model, Dataset and sampling rate.
Download

Paper Nr: 79
Title:

Surface Tracking in Coherence Scanning Interferometry by B-Spline Model and Teager-Kaiser Operator

Authors:

Fabien Salzenstein, Hassan Mortada, Vincent Mazet and Manuel Flury

Abstract: This work deals with the challenge of surface extraction using a combination of Teager-Kaiser operators and B-splines in the context of coherence scanning (or white light scanning i.e, WLSI) interferometry. Our approach defines a B-spline regularization model along surface profiles extracting their features by means of parameters locally describing fringe signals along the optical axis, while most studies are limited to a one-dimensional signal extraction. In doing so, we take into account four characteristic parameters under Gaussian hypothesis. The interest of the proposed strategy consists in processing the layers present in a material, in a context of soft roughness surfaces. The efficiency of our unsupervised method is illustrated on synthetic as well on real data.
Download

Paper Nr: 82
Title:

A Two-Stage Approach for Wire Harness Cable Description Using 3D Point Clouds for Robotic Manufacturing

Authors:

Takumi Okuyama, Pei-Chun Chien, Hiroki Tsukida, Yoichi Kato and Jun Ohya

Abstract: This paper proposes a two-stage methodology for accurately describing three-dimensional cable position aiming to examine robot-based automated cable placement systems to work correctly in wire harness manufacturing. The first stage is to extract 3D points on cables from point cloud acquired using a 3D stereo camera. For extracting only the points on cables, we compare two sets of point clouds which are taken before laying cables on an assembly board which holds the cables temporarily for taping and additional works, and after laying cables. We propose a new method to eliminate unnecessary points such as points on the assembly board and noises to get points only on cables from the two point cloud sets. In the second stage, cable positions are approximated as a mathematical function, B-Spline or Bézier curve, by interpolating the extracted 3D points. We use Smoothness, Curve Length, and Chamfer Distance as the evaluation criteria to assess the fitting quality to the original cable geometry. Experiment results indicate that B-Spline provides smooth approximation, while Bézier curve can represent curve with rapid transition such as sharp bents. Measured Chamfer Distance is a few times as large as the radius of the cables or shorter, demonstrating high fitting accuracy. This approach offers a practical solution for cable recognition in automation contexts, with potential applications in automotive and manufacturing industries.
Download

Paper Nr: 91
Title:

A Group Activity Based Method for Early Recognition of Surgical Processes Using the Camera Observing Surgeries in an Operating Room and Spatio-Temporal Graph Based Deep Learning Model

Authors:

Keishi Nishikawa and Jun Ohya

Abstract: Towards the realization of the scrub-nurse robot, this paper proposes a group activity-based method for the early recognition of surgical processes using an early part of the video acquired by the camera observing surgeries. Our proposed method consists of two steps. In the first step, we construct a spatial-temporal graphs which represents the group activity in operating room. The graph’s node contains (a) the visual features of participants and (b) the positions. In the second step, the generated graphs are input to our model for classification of the input. In our model, since the generated graph’ node contains both visual features and the position information, we treat the graph as the point cloud in spatial-temporal space. Therefore, Point Transformer Layer from (Zhao et al., 2021) is used as the building block. Experiments are conducted on public datasets; (Özsoy et al., 2022)’s mock surgery of knee replacement. The results show our method performs early recognition achieving the accuracy of 68.2 %9̃0.0 % in early duration such as 17.1 % ̃ 34.1 % of the entire durations from the beginning on the dataset. Furthermore, the comparison with the state-of-the-art method (Zhai et al., 2023) in early recognition of group activity is also conducted. It turns out that ours outperforms (Zhai et al., 2023) significantly.
Download

Paper Nr: 95
Title:

Visualizing Medical Coding Practices Using Transformer Models

Authors:

Tanner Hobson and Jian Huang

Abstract: In the United States, diagnostic codes are a key component of medical records that document the process of patient care. It has long been a common belief that there are inherent orders to the sequences of diagnosis codes in medical records. However, because of the complexities in medical records, there have been few tools that can automatically distill and make sense of the implicit ordering characteristics of the diagnostic codes within medical records. With the advent and fast advancement of the Transformer architecture, in this work we develop and demonstrate a transformer based model named DgPg. DgPg can automatically learn the patterns in the ordering of diagnostic codes in any given corpus of medical records, for example, those obtained from the same hospital or those from different hospitals but collected and organized around particular clinical scenarios. Using DgPg, we can flexibly visualize the coding patterns and context around any particular diagnostic code. Our results from DgPg further demonstrate that the model learned from one dataset can be unique to that dataset and, from this respect, confirm that medical coding practices have unique dependencies on the provider or the clinical scenarios. Our work uses three well known datasets: MIT’s MIMIC-IV dataset, and CDC’s NHDS and NHCS datasets. Our DgPg transformer models are only 2.5 MB in size. Such compact footprint enable flexibility in how the models can be deployed for real world use.
Download

Paper Nr: 98
Title:

Oil Spill Segmentation Using Deep Encoder-Decoder Models

Authors:

Abhishek Ramanathapura Satyanarayana and Maruf A. Dhali

Abstract: Crude oil is an integral component of the world economy and transportation sectors. With the growing demand for crude oil due to its widespread applications, accidental oil spills are unfortunate yet unavoidable. Even though oil spills are difficult to clean up, the first and foremost challenge is to detect them. In this research, the authors test the feasibility of deep encoder-decoder models that can be trained effectively to detect oil spills remotely. The work examines and compares the results from several segmentation models on high dimensional satellite Synthetic Aperture Radar (SAR) image data to pave the way for further in-depth research. Multiple combinations of models are used to run the experiments. The best-performing model is the one with the ResNet-50 encoder and DeepLabV3+ decoder. It achieves a mean Intersection over Union (IoU) of 64.868% and an improved class IoU of 61.549% for the “oil spill” class when compared with the previous benchmark model, which achieved a mean IoU of 65.05% and a class IoU of 53.38% for the “oil spill” class
Download

Paper Nr: 104
Title:

Anomaly Detection Methods for Maritime Search and Rescue

Authors:

Ryan Sime and Rohan Loveland

Abstract: Anomaly detection methods are employed to find swimmers and boats in open water in drone imagery from the SeaDronesSee dataset. The anomaly detection methods include variational autoencoder-based reconstruction loss, isolation forests, and the Farpoint algorithm. These methods are used with both the original feature space of the data and the encoded latent space representation produced by the variational autoencoder. We selected six images from the dataset and break them into small tiles, which are ranked by anomalousness by the various methods. Performance is evaluated based on how many tiles must be queried until the first positive tile is found compared to a random selection method. We find that the reduction of tiles that must be queried can range into factors in the thousands.
Download

Paper Nr: 105
Title:

FlowAct: A Proactive Multimodal Human-Robot Interaction System with Continuous Flow of Perception and Modular Action Sub-Systems

Authors:

Timothée Dhaussy, Bassam Jabaian and Fabrice Lefèvre

Abstract: The evolution of autonomous systems in the context of human-robot interaction systems requires a synergy between the continuous perception of the environment and the potential actions to navigate or interact with it. In this paper we present FlowAct, a proactive multimodal human-robot interaction architecture, working as an asynchronous endless loop of robot sensors into actuators, and organized by two controllers, the Environment State Tracking (EST) and the Action Planner. Through a series of real-world experiments, we exhibit the efficacy of the system in maintaining a continuous perception-action loop, substantially enhancing the responsiveness and adaptability of autonomous pro-active agents. The modular architecture of the action subsystems facilitates easy extensibility and adaptability to a broad spectrum of tasks and scenarios. The experiments demonstrate the ability of a Pepper robot governed by FlowAct to intervene proactively in laboratory tests and in the field in a hospital waiting room to offer participants various services (appointment management, information, entertainment, etc.).
Download

Paper Nr: 113
Title:

Classification of Oral Cancer and Leukoplakia Using Oral Images and Deep Learning with Multi-Scale Random Crop Self-Training

Authors:

Itsuki Hamada, Takaaki Ohkawauchi, Chisa Shibayama, Kitaro Yoshimitsu, Nobuyuki Kaibuchi, Katsuhisa Sakaguchi, Toshihiro Okamoto and Jun Ohya

Abstract: This paper proposes Multi-Scale Random Crop Self-Training (MSRCST) for classifying oral cancers and leukoplakia using oral images acquired by our dermoscope. MSRCST comprises the following three key modules: (1) Multi-Scale Random Crop, which extracts image patches at various scales from high-resolution images, preserving both local details and global contextual information essential for accurate classification, (2) Selection based on Confidence, which employs a teacher model to assign confidence scores to each cropped patch, selecting only those with high confidence for further training and ensuring the model focusing on diagnostically relevant features, (3) Iteration of Self-training, which iteratively retrains the model using the selected high-confidence, pseudo-labeled data, progressively enhancing accuracy. In our experiments, we applied MSRCST to classify images of oral cancer and leukoplakia. When combined with MixUp data augmentation, MSRCST achieved an average classification accuracy of 71.71%, outperforming traditional resizing and random cropping methods. Additionally, it effectively reduced misclassification rates, as demonstrated by improved confusion matrices, thereby enhancing diagnostic reliability.
Download

Paper Nr: 121
Title:

Class-Specific Dataset Splitting for YOLOv8: Improving Real-Time Performance in NVIDIA Jetson Nano for Faster Autonomous Forklifts

Authors:

Chaouki Tadjine, Abdelkrim Ouafi, Abdelmalik Taleb-Ahmed and Yassin El Hillali

Abstract: This research examines a class-specific YOLOv8 model setup for real-time object detection using the Logistics Objects in Context dataset, specifically looking at how it can be used in high-speed autonomous forklifts to enhance obstacle detection. The dataset contains five common object classes in logistics warehouses. It is divided into transporting tools (forklift and pallet truck) and goods-carrying tools (pallet, small load carrier, and stillage) to meet specific task needs. Two YOLOv8 models were individually trained and implemented on the NVIDIA Jetson Nano, with each one specifically optimized for a tool category. Using this approach tailored to specific classes resulted in a 30.6 percent decrease in inference time compared to training a single YOLOv8 model on all classes. Task-specific detection saw a 74.4 percent improvement in inference time for transporting tools and 56.2 percent improvement for goods-carrying tools. Furthermore, the technique decreased the hypothetical distance traveled during inference from 45.14 cm to 31.32 cm and even as low as 11.55 cm for transporting tools detecting while still preserving detection accuracy with a minor drop of 1.25% in mean average precision. The integration of these models onto the NVIDIA Jetson Nano made this approach compatible for future autonomous forklifts and showcases the potential of the technique to improve industrial automation. This study demonstrates a useful and effective method for real-time object detection in intricate warehouse settings by matching detection tasks with practical needs.
Download

Paper Nr: 135
Title:

Towards Improving Translation Ability of Large Language Models on Low Resource Languages

Authors:

Amulya Ratna Dash and Yashvardhan Sharma

Abstract: With advancements in Natural Language Processing (NLP) and Large Language Models (LLMs), there is a growing need to understand their capabilities with low resource languages. This study focuses on benchmarking and improving the machine translation ability of LLMs for low resource Indic languages. We analyze the impact of training dataset sizes and overfitting by training for additional epochs on translation quality. We use LLaMA-3 as the base model and propose a simple resource efficient model finetuning approach which improves the zero-shot translation performance consistently across eight translation directions.
Download

Paper Nr: 136
Title:

Language-Aware and Language-Agnostic Multilingual Speech Recognition with a Single Model

Authors:

Karol Nowakowski and Michal Ptaszynski

Abstract: In recent years, there has been increasing interest in multilingual speech recognition systems, where a single model can transcribe speech in multiple languages. Additional benefit of multilingual learning is that it allows for cross-lingual transfer, often leading to better performance, especially in low-resource languages. On the other hand, multilingual models suffer from errors caused by confusion between languages. This problem can be mitigated by providing the information about language identity as an additional input to the model. In this research, we carry out experiments using a modern state-of-the-art ASR system architecture based on a pretrained multilingual wav2vec 2.0 model and adapter modules trained for the downstream task, and confirm that multilingual supervised learning with language identifiers is a viable method for improving the system’s overall performance. Furthermore, we find that training with language identifiers still yields a model with better average performance than the model trained without such information, even if language identity is unknown at inference time.
Download

Paper Nr: 42
Title:

A Comparative Analysis of Hyperparameter Effects on CNN Architectures for Facial Emotion Recognition

Authors:

Benjamin Grillo, Maria Kontorinaki and Fiona Sammut

Abstract: This study investigates facial emotion recognition, an area of computer vision that involves identifying human emotions from facial expressions. It approaches facial emotion recognition as a classification task using labelled images. More specifically, we use the FER2013 dataset and employ Convolutional Neural Networks due to their capacity to efficiently process and extract hierarchical features from image data. This research utilises custom network architectures to compare the impact of various hyperparameters - such as the number of convolutional layers, regularisation parameters, and learning rates - on model performance. Hyperparameters are systematically tuned to determine their effects on accuracy and overall performance. According to various studies, the best-performing models on the FER2013 dataset surpass human-level performance, which is between 65% and 68%. While our models did not achieve the best-reported accuracy in literature, the findings still provide valuable insights into hyperparameter optimisation for facial emotion recognition, demonstrating the impact of different configurations on model performance and contributing to ongoing research in this area.
Download

Paper Nr: 43
Title:

Non Contact Stress Assessment Based on Deep Tabular Method

Authors:

Urmila and Avantika Singh

Abstract: In today’s competitive world, stress is the major factor that influences human health negatively. In the long term, stress can lead to serious health problems such as diabetes, depression, anxiety, and various heart diseases. Thus, timely stress recognition is important for efficient stress management. Currently, for stress assessment various wearable devices are used to capture physiological signals. However, these devices although accurate are cost-sensitive and requires direct physical contact which may lead to discomfort in long run. In this work we have introduced a tabular based deep learning architecture for detecting stress by analyzing physiological features. The architecture extracts physiological features from remote photoplethysmography (rPPG) signals computed from facial videos. The proposed architecture is validated on publicly available UBFC-Phys dataset for two sets of experiments (i) Stress task classification and (ii) Multi-level stress classification. For both set of experiments the proposed methodology outperforms the current state-of-art method. The code is available at https://github.com/Heeya2205/Deep_tabular_methods.
Download

Paper Nr: 45
Title:

Anomalous Event Detection in Traffic Audio

Authors:

Minakshee Shukla, Renu M. Rameshan and Shikha Gupta

Abstract: This work focuses on detecting anomalous sound events from traffic audio. The audio used is recorded from the microphone associated with a surveillance camera. We have defined six anomaly classes and generated synthetic data using real background audio which corresponds to Indian traffic sound obtained from a surveillance camera microphone. Using a teacher-student training strategy, we have obtained F1 score of 96.12% and an error rate 0.06. We also show that even when the event occurs farther away from the microphone, the performance is still impressive, with an F1 score of 92.55 and an error rate of 0.12.
Download

Paper Nr: 46
Title:

Silhouette Segmentation for Near-Fall Detection Through Analysis of Human Movements in Surveillance Videos

Authors:

Imane Bouraoui and Jean Meunier

Abstract: The detection of near-fall incidents is crucial in surveillance systems to improve safety, prevent future more serious falls and ensure rapid intervention. the main objective of this paper is the detection of movement anomalies in a series of video sequences through silhouette segmentation. First, we begin by isolating the person from the background, keeping only the person's silhouette. This is achieved through two methods: the first involves median pixel, while the second utilizes an algorithm based on pre-trained Mask Regional Convolutional Neural Network (Mask R-CNN) model. the second step involves movement calculation and noise effect minimization. Finally, we conclude by classifying normal and abnormal movement signals obtained using two different classifiers: Support Vector Machine (SVM) and Autoencoder (AE). We then compare the results to determine the most efficient and rapid system for detecting near-falls. the experimental results demonstrate the effectiveness of the proposed approach in detecting near-fall incidents. Specifically, the Mask R-CNN approach outperformed the median pixel method in silhouette extraction, enhancing anomaly detection accuracy. AE surpassed SVM in accuracy and performance, making it suitable for real-time near-fall detection in surveillance applications.
Download

Paper Nr: 47
Title:

Single Hyperspectral Image Super-Resolution Utilizing Implicit Neural Representations

Authors:

Bohdan Perederei and Faisal Z. Qureshi

Abstract: Hyperspectral image super-resolution is a crucial task in computer vision, aiming to enhance the spatial resolution of hyperspectral data while maintaining spectral fidelity. In this paper, we introduce highlights and outcomes of our research, in which we developed, explored, and evaluated different techniques and methods based on Implicit Neural Representations (INRs) for conducting Single Hyperspectral Image Super-Resolution. Despite the potential of INRs, their application to hyperspectral image super-resolution still needs to be explored, with significant room for further investigation. Our primary goal was to adapt strategies and techniques from models originally developed for multispectral image super-resolution, especially SIREN-based INRs and the Dual Interactive Implicit Neural Network architectures. We also explored feature extraction from hyper-spectral images using a convolutional neural network autoencoder that allowed us to capture spatial-spectral patterns for further enhancement. Furthermore, as a part of the research, we validated and compared different functions, such as MSE, RMSE, MAE, PSNR, SAD, SAM, and SSIM, to evaluate their effectiveness as loss functions for training INRs.
Download

Paper Nr: 57
Title:

Occlusion Detection for Face Image Quality Assessment

Authors:

Jacob Carnap, Alexander Kurz, Olaf Henniger and Arjan Kuijper

Abstract: The accuracy of 2D-image-based face recognition systems depends on the quality of the compared face images. One factor that affects the recognition accuracy is the occlusion of face regions, e.g., by opaque sunglasses or medical face masks. Being able to assess the quality of captured face images can be useful in various scenarios, e.g., in a border entry/exit system. This paper discusses a method for detecting face occlusions and for measuring the percentage of occlusion of a face using face segmentation and face landmark estimation techniques. The method is applicable to arbitrary face images, not only to frontal or nearly frontal face images. The method was evaluated by applying it to publicly available face image data sets and analyzing the results obtained. The evaluation shows that the proposed method enables the effective detection of face occlusions.
Download

Paper Nr: 84
Title:

Improving Classification in Skin Lesion Analysis Through Segmentation

Authors:

Mirco Gallazzi, Anwar Ur Rehman, Silvia Corchs and Ignazio Gallo

Abstract: Deep Learning plays a vital role in medical imaging, especially in classification and segmentation tasks essential for diagnosing diseases from images. However, current methods often struggle to differentiate visually similar classes and accurately delineate lesion boundaries. This study builds on prior findings of classification limitations, investigating whether segmentation can improve classification performance for skin lesion analysis with Transformer-based models. We benchmarked the segmentation capabilities of the Swin Transformer, YOLOv8, and DeepLabV3 architectures on the HAM dataset, which contains 10,015 images across seven skin lesion classes. Swin outperformed others in segmentation, achieving an intersection over union of 82.75%, while YOLOv8 achieved 77.0%. However, classification experiments using classification datasets after segmenting and cropping the lesion of interest did not produce the expected improvements, with classification accuracy showing slight drops in the segmented data. For example, on the original HAM dataset, the model achieved a Test Accuracy (TA) of 84.64%, while Swin trained on segmented data showed a slight decline to a TA of 84.13%. These findings suggest that segmentation alone may not effectively support classification. Based on this, we propose future research into a sequential transfer learning approach, where segmentation knowledge could be progressively transferred to improve classification.
Download

Paper Nr: 88
Title:

Ultrasonic Large Scenario Model (ULSM): Vector Embedding System for Ultrasonic Echo Wave Characteristics

Authors:

Shafait Azam, Mashnunul Huq and Andreas Pech

Abstract: Ultrasonic sensors emitting ultrasound waves can be effectively used in Human Computer Interaction (HCI) to assist visually disabled humans. With the embedding of the sensor echoes into assistive tools, real-time spatial awareness for mobility is enhanced. Moreover, material identification aids object recognition by detecting different materials through their echo signatures. In this article, we study the use of ultrasonic sensors in HCI systems focusing on their ability to detect materials by analysing the ultrasonic wave characteristics. These services aim to improve the autonomy and security of people with visual impairments, offering a complete assistive solution for daily navigation and interaction processes. We have planned to create a vector database for storing these embeddings generated from reflected waves of various materials and objects. In this work, we propose a precise vector embeddings generation framework for ultrasonic systems using ResNet50 convolutional neural network. In the future, Generative AI will use these embeddings to serve a range of applications for greater autonomy and safety, providing an assistive travel and interaction solution for the visually impaired.
Download

Paper Nr: 96
Title:

Predicting Photovoltaic Power Output Using LSTM: A Comparative Study Using both Historical and Climate Data

Authors:

Fereshteh Jafari, Joseph Moerschell and Kaspar Riesen

Abstract: Accurate photovoltaic (PV) power output prediction is important for efficient energy management in solar power systems. This study explores the benefits and limitations of Long Short-Term Memory (LSTM) networks in predicting PV power using three distinct approaches, namely using historical PV power data, climate data, and a combination of both, all with timestamps. The performance of these methods is evaluated across different prediction horizons of 10, 30, and 50 minutes ahead. Additionally, the impact of the sliding window size, representing the amount of past data used for training, is analyzed. The models are trained and tested on a dataset collected over three months from a rooftop PV system in Sion, Switzerland, with a maximum power of 22.2 kW. The Root Mean Square Error as well as the R2 metrics are provided to assess the accuracy of each method. The results demonstrate that both the choice of the actual input data and the sliding window size significantly influence the prediction accuracy. In particular, the results presented here show the potential of combining different data sources to improve the accuracy of PV power prediction using LSTM models.
Download

Paper Nr: 101
Title:

Deep Learning for Effective Classification and Information Extraction of Financial Documents

Authors:

Valentin-Adrian Serbanescu and Maruf Dhali

Abstract: The financial and accounting sectors are encountering increased demands to effectively manage large volumes of documents in today’s digital environment. Meeting this demand is crucial for accurate archiving, maintaining efficiency and competitiveness, and ensuring operational excellence in the industry. This study proposes and analyzes machine learning-based pipelines to effectively classify and extract information from scanned and photographed financial documents, such as invoices, receipts, bank statements, etc. It also addresses the challenges associated with financial document processing using deep learning techniques. This research explores several models, including LeNet5, VGG19, and MobileNetV2 for document classification and RoBERTa, LayoutLMv3, and GraphDoc for information extraction. The models are trained and tested on financial documents from previously available benchmark datasets and a new dataset with financial documents in Romanian. Results show MobileNetV2 excels in classification tasks (with accuracies of 99.24% with data augmentation and 93.33% without augmentation), while RoBERTa and LayoutLMv3 lead in extraction tasks (with F1-scores of 0.7761 and 0.7426, respectively). Despite the challenges posed by the imbalanced dataset and cross-language documents, the proposed pipeline shows potential for automating the processing of financial documents in the relevant sectors.
Download

Paper Nr: 103
Title:

LostPaw: Finding Lost Pets Using a Contrastive Learning-Based Transformer with Visual Input

Authors:

Andrei Voinea, Robin Kock and Maruf A. Dhali

Abstract: Losing pets can be highly distressing for pet owners, and finding a lost pet is often challenging and time-consuming. An artificial intelligence-based application can significantly improve the speed and accuracy of finding lost pets. To facilitate such an application, this study introduces a contrastive neural network model capable of accurately distinguishing between images of pets. The model was trained on a large dataset of dog images and evaluated through 3-fold cross-validation. Following 350 epochs of training, the model achieved a test accuracy of 90%. Furthermore, overfitting was avoided, as the test accuracy closely matched the training accuracy. Our findings suggest that contrastive neural network models hold promise as a tool for locating lost pets. This paper presents the foundational framework for a potential web application designed to assist users in locating their missing pets. The application will allow users to upload images of their lost pets and provide notifications when matching images are identified within its image database. This functionality aims to enhance the efficiency and accuracy with which pet owners can search for and reunite with their beloved animals.
Download

Paper Nr: 131
Title:

Multilabel Classification of Otoscopy Images in Deep Learning for Detailed Assessment of Eardrum Condition

Authors:

Antoine Perry, Ilaria Renna, Florence Rossant and Nicolas Wallaert

Abstract: This study presents a ResNet50-based CNN framework for multi-label classification of eardrum images, focusing on a detailed diagnosis of otologic disorders. Unlike prior studies centered on common pathologies, our approach explores less common eardrum conditions using a dataset of 4836 images annotated by two audiologists. The model effectively identifies various pathologies and conditions that can coexist in clinical practice, with a Jaccard score of 0.84, indicating a high level of agreement with the annotations made by an expert. This score notably exceeds the interoperator agreement (0.69) between the two audiologists. This demonstrates the model’s accuracy but also its potential as a reliable tool for clinical diagnosis.
Download

Paper Nr: 144
Title:

A Study on the Robustness of Object Detectors in Aqua-Farming

Authors:

Rajarshi Biswas, Om Khairate, Mohamed Salman and Dirk Werth

Abstract: In this paper, we study the robustness of state-of-the-art object detectors under transfer learning to detect live fishes swimming inside a fish tank. To overcome data limitations, we perform experiments in which we train these detectors with small amounts of annotated data and observe their robustness on out-of-domain data while tracking performance on in-domain test data. We compare YOLOv8l, RTMDet, RT-DETR, SSD-MobileNet and Faster-RCNN for performing dense object detection on images of fish schools obtained from an aqua-farm and observe their robustness on out-of-domain data from the MS COCO, ImageNet, and Pascal VOC datasets respectively. On the in-domain test set, we achieved the highest detection accuracy of 0.896 mAP with bounding boxes and 0.9214 mAP with instance masks using the YOLOv8l model. However, the same model exhibits a false positive rate of 55.77% on out-of-domain data from the MS COCO dataset. To mitigate false positive prediction we studied two different strategies, (1) re-training the models incorporating out-of-domain data and (2) re-training models by updating only the biases. We found that incorporating out-of-domain data to train the models leads to the highest reduction in false positive detection, however, this does not guarantee steady and high performance on the in-domain test data.
Download

Paper Nr: 145
Title:

An OGI Based Unattended Natural Gas Leak Detection System by Utilizing the Power of Machine Learning and Computer Vision

Authors:

Hritu Raj and Gargi Srivastava

Abstract: In a climate-constrained future, reducing natural gas emissions is essential to prevent undermining the environmental benefits of using natural gas over coal. Although optical gas imaging (OGI) is widely used for detecting natural gas leaks, it is often time-consuming and relies on human intervention for leak identification. This study presents an operator-less solution for automated leak detection using convolutional neural networks (CNNs). Our approach utilizes a dataset of natural gas leaks to train a CNN model for automated plume recognition. We begin by gathering 32 video clips labeled with gas leaks from the FLIR dataset, which covers a variety of leak sizes (50-1800 g/h) and video capture distances (4.2-18.3 m) .Two background removal techniques were applied to isolate the gas plumes. A modified CNN model, trained with a combination of natural gas and smoke images from Kaggle, was then used to detect the plumes in the video frames. Our trained model was evaluated against other algorithms based on optical flow, showing impressive performance. Our CNN model achieved an accuracy of 99% in detecting medium/large leaks and 94% for small leaks. This approach offers a promising method for high-accuracy natural gas leak identification in real-world OGI assessments.
Download

Paper Nr: 152
Title:

Temporal Pattern Analysis of Baggage Impact on Flight Operations

Authors:

Necip Gozuacik, Adem Tekinbas, Engin Sag, Onur Adiguzel and Sibel Malkos

Abstract: Transportation hubs like airports are increasingly complex due to globalization, resulting in diverse operations and stakeholders with often conflicting objectives. Operational inefficiencies, exacerbated by unpredictable demand and external factors such as weather, can cause economic losses and reduce sustainability. This paper aims to apply holistic approach via exploratory data analysis, integration of datasets, advanced pre-processing techniques and reveal pattern analysis. This will be a prerequisite work for solving the inefficiencies by centralizing monitoring and tracking of all operations Additionally, given the large volume of structured and unstructured data, this study underscores the importance of big data processing. Utilizing NoSQL database technology, specifically Cassandra, enables scalable, high-performance handling of millions of rows of data. The conclusions offer insights about the importance of temporal pattern analysis regarding future AI developments.
Download