Abstracts Track 2026


Area 1 - Theory and Methods

Nr: 93
Title:

Comparing MoP-TAN with Discrete and Conditional Linear Gaussian Bayesian Network Classifiers for Gene Duplication

Authors:

Ángel T. Sáez-Ruiz, Ana D. Maldonado, Lorenzo Carretero-Paulet, Aaron Gálvez-Salido and Rafel Rumi

Abstract: Modelling hybrid random variables with Bayesian networks (BNs) is a challenging task. The discretization of continuous variables is widely used due to its simplicity. Conditional Linear Gaussian (CLG) models offer an alternative by assuming that continuous variables follow a multivariate Gaussian distribution, and they do not allow continuous variables to have discrete children in the graph. On the other hand, Mixtures of Polynomials (MoPs) models allow discrete and continuous variables to coexist naturally and avoid the restrictive CLG assumption. In a previous work, we compare the same Conditional Probability Distribution (CPD) representations (CLG, Discrete and MoP) using Naïve Bayes (NB) and Tree-Augmented Naïve Bayes (TAN) classifiers, where the structural learning was carried out following a discretisation process. In this new work, we introduce a new approach for classification problems involving hybrid domains. The proposed model, MoP-TAN, is a TAN classifier that uses MoP CPD representation during the structural learning process. This approach directly links network dependencies to their conditional density. The MoP-TAN structure is learned by computing the explicit formula of mutual information (MI) between two predictors conditioned on the target variable, which does not require any assumption, using MoP CPD and following the same philosophy as the original TAN. Parameter estimation is also performed using the MoP CPD formulation. To assess the performance of the proposed model, we compare MoP-TAN with its discrete and CLG counterparts, including their NB versions, in an empirical study on the problem of classifying duplicated genes in Arabidopsis thaliana. In this problem, the target variable, duplicability, has a frequency distribution of P(Duplicated) = 0.83 and P(Non-Duplicated) = 0.17. For model validation, the 10-cross-validation technique has been considered along with common classification performance metrics: global accuracy, ROC curves, AUC and F1-Score of the minority class. For each combination of structure and CPD representation, four variable selection criteria are considered: global accuracy, precision for the minority class, global accuracy on a balanced training set obtained through under-sampling, and a non-selective approach. For the selective models, we adopt the following filter-wrapper strategy: predictors are ranked by their MI with the target variable (computed using the corresponding CPD representation) and incorporated one by one sequentially. A predictor is retained into the model only if its inclusion improves model performance; otherwise, it is discarded. In total, 24 models are evaluated, comparing the effects of structure, CPD representation and selection criterion. For each combination of structure and CPD representations, the variable selection criterion has an impact in the performance of models. The lowest variability in the performance is observed in MoP-TAN models. In general, models based on accuracy maximisation obtain the best results across all performance metric considered. The results reveal that MoP-TAN outperforms CLG and Discrete models in most pairwise comparisons for the different performance metrics. Overall, the proposed MoP-TAN model supposes a competitive strategy in classification problems against traditional BN classifiers and could be easily adapted to regression problems. However, its main drawback is the computational cost of MoP parameter estimation.

Nr: 96
Title:

tMoPs-Based Density Estimation for Unconstrained Hybrid Bayesian Networks

Authors:

Juan Carlos Luengo, Rafel Rumi and Dario Ramos Lopez

Abstract: Mixtures of Polynomials (MoPs) are frequently used to model continuous variables in hybrid Bayesian networks (BNs). A particular case of them are Mixtures of Polynomials with tails (tMoPs). In this new approach, densities corresponding to continuous variables are fitted using a polynomial and, if necessary, two tails, which are constant polynomials. The use of tails allows a relatively simple approximation of multiple variables whose probability density near the boundaries of the domain is close to 0, while it provides an accurate fit in central areas of higher probability. This approach also avoids regions of negative probability, typical of a standard polynomial fit. The fitting process begins by obtaining balanced kernel density estimation points, reducing the number of points in areas of lower density. At the beginning, least squares linear regression is used to fit these balanced kernel points, yielding a 10-degree polynomial. When this polynomial presents roots within the domain, a constant tail is added from the root value to the endpoint of the domain, avoiding negative values. When the polynomial and the tails are computed, the density is normalized. In this way the non-negativity and normalization are both guaranteed. This procedure is repeated with polynomials of decreasing degrees, as long as the reductions produce an improvement according to the Bayesian Information Criterion (BIC). This method can also be used for learning conditional probability distributions. In previous tMoP-related works, the number of parents was limited to two (either discrete, continuous or both). A marginal density is estimated as mentioned above for a specific region of the conditioning variables. These regions are obtained as follows: when one or both parents are continuous, a process of dynamic partitioning of the domains is carried out. Firstly, parents’ domains are split into two halves. Afterwards, the domain of the parents may be split again if the quality of fit increases. Further subdivisions are considered (covering, at least, 1/8 of the domain), splitting the domain of each parent while the rest remain unchanged. Among these subdivided domains, the best fit according to the BIC is chosen. To reduce the complexity, no partitions are made when it creates regions which include less than 5% of the observations. To test the validity of our proposal, numerous experiments have been carried out using real datasets with different characteristics and two well-known constrained BN structures: naive Bayes and tree-augmented naive Bayes, which have been used in classification and regression methods. In this work, we extend this approach to unconstrained BNs, where the limitation of two parents in conditional distributions is removed. This new density estimation method allows us to define a structural learning procedure to create hybrid BNs with a general structure, where the continuous variables are learnt using tMoPs. To perform this general BN structural learning procedure, a one step ahead process is carried out. It starts from an empty network and, in each iteration, all possible single-step modifications are evaluated. These modifications include: adding a new edge, removing and reversing an existing edge. The change that leads to a greater increase of the global BIC of the network is applied. We present newly developed learning algorithms and evaluate them with further experiments using hybrid datasets with different characteristics.

Area 2 - Applications

Nr: 117
Title:

A Research Dataset Recommender System Based on RAG

Authors:

Atsuhiro Takasu, Padipat Sitkrongwong and Pongsakorn Jirachanchaisiri

Abstract: In recent years, data-driven research has progressed in many fields, and the role of research data has become increasingly important. Currently, various research data infrastructures are being constructed, and there is a growing need for technology that can access these research data infrastructures across them and effectively search for necessary research datasets (RDS). While metadata is typically used to search for datasets, this metadata is problematic in that it is not necessarily uniform, with the proportion of missing data varying from one research data to another. Information about RDSs can also be obtained from academic papers using the RDS. Academic papers contain multifaceted information about RDSs, such as summaries and usage methods, and the granularity and quality of the descriptions do not differ depending on the research data infrastructure, so stable and uniform information about RDSs can be expected. This abstract proposes a research dataset recommendation system that utilizes both metadata of RDS and academic papers using them. Recent advances in large language models (LLMs) have been remarkable. LLMs are expected to improve recommendation performance, and LLMs can be used to explain search results. Therefore, this system is based on the Retrieval Augmented Generation (RAG) framework that consists of a research dataset retrieval system and an LLM. The dataset retrieval system consists of an encoder for metadata of research datasets and an encoder for academic papers summarizing research results using the datasets. A pre-trained Transformer is used for metadata-based representation learning. Meanwhile, academic paper-based representation learning uses a graph neural network that uses descriptions of academic papers and a graph structure based on the usage relationships between academic papers and datasets. These embedded representations are stored in a vector database, enabling high-speed retrieval. Search results are reranked by the LLM, which also generates the reasons for each dataset being recommended. A prompt template is used for this purpose.

Nr: 118
Title:

Contrastive Learning Network for Zero-Shot Automatic Scoring of Very Short Answers

Authors:

Nam Tuan Ly, Hung Tuan Nguyen and Masaki Nakagawa

Abstract: Most automatic scoring methods for handwritten answers are based on handwriting recognizers. However, handwriting recognizers typically rely on predefined dictionaries, which can result in false positives when handling out-of-vocabulary characters or wrong characters that resemble dictionary entries. This is serious for beginners to learn languages. To address this limitation, this paper proposes a Contrastive Learning Network-based automatic scoring method to improve the scoring of single- or few-character answers. The proposed method consists of two main components: Contrastive Learning Network-based Pattern Similarity and Pattern Similarity-based Automatic Scoring Algorithm. The proposed Contrastive Learning Network is based on the Siamese network architecture, which comprises two identical backbones to calculate the similarity between the two input images. It is trained using a contrastive loss function on pairs of two input answers and their similarity. The pattern-similarity-based automatic scoring algorithm scores an answer as correct, incorrect, or rejected according to its similarity to the expected answer. To train the Contrastive Learning Network, we propose a novel data-sampling method using a dataset of handwritten answers. The extensive experiments on a collection of handwritten answers from elementary school students, comprising 98,547 Japanese and 15,896 English answers, demonstrate the superiority of the proposed method over the previous handwriting-recognition-based methods and its effectiveness for zero-shot automatic scoring, making it useful in low-resource settings. We also conduct ablation studies to evaluate the effects of different backbones and loss functions on the performance of the proposed scorer.