| Abstract: |
Modelling hybrid random variables with Bayesian networks (BNs) is a challenging task. The discretization of continuous variables is widely used due to its simplicity. Conditional Linear Gaussian (CLG) models offer an alternative by assuming that continuous variables follow a multivariate Gaussian distribution, and they do not allow continuous variables to have discrete children in the graph. On the other hand, Mixtures of Polynomials (MoPs) models allow discrete and continuous variables to coexist naturally and avoid the restrictive CLG assumption. In a previous work, we compare the same Conditional Probability Distribution (CPD) representations (CLG, Discrete and MoP) using Naïve Bayes (NB) and Tree-Augmented Naïve Bayes (TAN) classifiers, where the structural learning was carried out following a discretisation process.
In this new work, we introduce a new approach for classification problems involving hybrid domains. The proposed model, MoP-TAN, is a TAN classifier that uses MoP CPD representation during the structural learning process. This approach directly links network dependencies to their conditional density. The MoP-TAN structure is learned by computing the explicit formula of mutual information (MI) between two predictors conditioned on the target variable, which does not require any assumption, using MoP CPD and following the same philosophy as the original TAN. Parameter estimation is also performed using the MoP CPD formulation.
To assess the performance of the proposed model, we compare MoP-TAN with its discrete and CLG counterparts, including their NB versions, in an empirical study on the problem of classifying duplicated genes in Arabidopsis thaliana. In this problem, the target variable, duplicability, has a frequency distribution of P(Duplicated) = 0.83 and P(Non-Duplicated) = 0.17.
For model validation, the 10-cross-validation technique has been considered along with common classification performance metrics: global accuracy, ROC curves, AUC and F1-Score of the minority class. For each combination of structure and CPD representation, four variable selection criteria are considered: global accuracy, precision for the minority class, global accuracy on a balanced training set obtained through under-sampling, and a non-selective approach.
For the selective models, we adopt the following filter-wrapper strategy: predictors are ranked by their MI with the target variable (computed using the corresponding CPD representation) and incorporated one by one sequentially. A predictor is retained into the model only if its inclusion improves model performance; otherwise, it is discarded. In total, 24 models are evaluated, comparing the effects of structure, CPD representation and selection criterion. For each combination of structure and CPD representations, the variable selection criterion has an impact in the performance of models.
The lowest variability in the performance is observed in MoP-TAN models. In general, models based on accuracy maximisation obtain the best results across all performance metric considered. The results reveal that MoP-TAN outperforms CLG and Discrete models in most pairwise comparisons for the different performance metrics. Overall, the proposed MoP-TAN model supposes a competitive strategy in classification problems against traditional BN classifiers and could be easily adapted to regression problems. However, its main drawback is the computational cost of MoP parameter estimation. |