Statistical Data Science Research Group
We work on symbolic data analysis, imbalanced classification, computational statistics, statistical data science, and applied data science research. I am open to supervising students; please reach out via ahmadhakiim@upm.edu.my.
Symbolic Data Analysis
- Beranger (2023) proposed a novel Symbolic Data Analysis (SDA) approach that models classical data directly, treating symbolic random variables as summaries to address challenges with large, complex datasets. Unlike Le-Rademacher and Billard (2011) and Brito and Silva (2012), who relied on assumptions about symbolic data statistics such as mid-points and log-ranges, Zhang, Beranger, and Sisson (2020) and Beranger (2023) developed a likelihood-based framework for fitting models at the classical data level, even when only symbolic data is observed. My group extends this work by focusing on optimal designs for symbolic data and applying SDA to statistical methods, including mixture models, to enhance their practical utility.
Imbalanced Classication
- Class imbalance occurs when the distribution of classes in a dataset is heavily skewed, causing standard machine learning algorithms to favor the majority class and yield poor minority class predictions, with severe consequences in domains like medical diagnosis and fraud detection. Compounding this issue are intrinsic data difficulty factors such as class overlap, small disjuncts, and noise, which interact non-linearly to amplify learning challenges beyond mere imbalance ratios. Our research addresses this through resampling methods (oversampling, undersampling, and hybrid techniques) by developing a meta-learning framework that leverages comprehensive dataset meta-features (e.g., complexity metrics) to recommend optimal, dataset-specific resampling strategies tailored to user-defined objectives like F1-score or G-mean, eliminating trial-and-error and enhancing adaptability in imbalanced classification.
Honours: 2025 - Nur Zafnazuhani Jailani
Computational Statistics
- Computational statistics involves developing and applying computational methods to analyze and interpret complex data, often using algorithms and simulations to solve statistical problems. It encompasses techniques such as Monte Carlo methods, machine learning, and data visualization to extract insights from large datasets. By leveraging high-performance computing, it enables efficient processing of statistical models for real-world applications. My group focuses on advancing these techniques, primarily using R and Python, to tackle sophisticated data analysis challenges.
