Journal | BenchCouncil

Download Volume 4, Issue 4

Original Articles

Open Source Evaluatology: An evaluation framework and methodology for open source ecosystems based on evaluatology

Fanyu Han, Shengyu Zhao, Wei Wang, Aoying Zhou, ... Chunqi Tian
https://doi.org/10.1016/j.tbench.2025.100190

Abstract

The open-source ecosystem, as an important component of the modern software industry, has increasingly attracted attention from both academia and industry regarding its evaluation. However, current open-source evaluation methods face several issues, such as inconsistent evaluation standards, lack of theoretical support in the evaluation process, and poor comparability of evaluation results. Guided by the foundational theories of evaluatology, this paper proposes a new interdisciplinary research field, Open Source Evaluatology, and constructs an evaluation theoretical framework and methodology for open-source ecosystems. The main contributions of this paper include: (1) Based on the five axioms of evaluation theory, a theoretical system for Open Source Evaluatology is developed, and the basic concepts, evaluation dimensions, and evaluation standards for the open-source ecosystem are proposed; (2) An evaluation conditions (EC) framework is designed, encompassing five levels: problem definition, task instances, algorithm mechanisms, implementation examples, and supporting systems. A combined evaluation model (EM) based on statistical metrics and network metrics is also introduced; (3) Experimental validation using the GitHub dataset shows that the proposed evaluation framework effectively assesses various features of open-source projects, developers, and communities, and has been verified in multiple practical application scenarios. The research demonstrates that Open Source Evaluatology provides a standardized theoretical guide and methodological support for open-source ecosystem evaluation, which can be widely applied in various scenarios, such as open-source project selection, developer evaluation, and community management, and plays a significant role in promoting the healthy and sustainable development of open-source ecosystems.

COADBench: A benchmark for revealing the relationship between AI models and clinical outcomes

Jiyue Xie, Wenjing Liu, Li Ma, Caiqin Yao, ... Yunyou Huang
https://doi.org/10.1016/j.tbench.2025.100198

Abstract

Alzheimer’s disease (AD), due to its irreversible nature and the severe social burden it causes, has garnered significant attention from AI researchers. Numerous auxiliary diagnostic models have been developed with the aim of improving AD diagnostic services and thereby reducing the social burden. However, due to a lack of validation regarding the clinical value of these models, no AD diagnostic model has been widely accepted by clinicians or officially approved for use in enhancing AD diagnostic services. The clinical value of traditional medical devices is validated through rigorous randomized controlled trials to prove their impact on clinical outcomes. In contrast, current AD diagnostic models are only validated based on their accuracy, and the relationship between these models and patient outcomes remains unknown. This gap has hindered the acceptance and clinical use of AD diagnostic models by healthcare professionals. To address this issue, we introduce the COADBench, a benchmark centered on clinical outcomes for evaluating the clinical value of AD diagnostic models. COADBench curated subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database who have at least two cognitive score records (the most commonly used clinical endpoint in AD clinical trials) from different follow-up visits. To the best of our knowledge, for the first time, it links the cognitive scores of subjects with model performance, using patient cognitive scores as clinical outcomes after intervention to evaluate the models. Through the benchmarking of current mainstream AD diagnostic algorithms using COADBench, we find that there was no significant correlation between the subjects’ cognitive improvement and the model’s performance, which means that the current performance evaluation criteria of mainstream AD diagnostic algorithms are not combined with clinical value.

Evaluating long-term usage patterns of open source datasets: A citation network approach

Jiaheng Peng, Fanyu Han, Wei Wang
https://doi.org/10.1016/j.tbench.2025.100199

Abstract

The evaluation of datasets serves as a fundamental basis for tasks in evaluatology. Evaluating the usage patterns of datasets has a significant impact on the selection of appropriate datasets. Many renowned Open Source datasets are well-established and have not been updated for many years, yet they continue to be widely used by a large number of researchers. Due to this characteristic, conventional Open Source metrics (e.g., number of stars, issues, and activity) are insufficient for evaluating the long-term usage patterns based on log activity data from their GitHub repositories.

Researchers often encounter significant challenges in selecting appropriate datasets due to the lack of insight into how these datasets are being utilized. To address this challenge, this paper proposes establishing a connection between Open Source datasets and the citation networks of their corresponding academic papers. By mining the citation network of the corresponding academic paper, we can obtain rich graph-structured information, such as citation times, authors, and more. Utilizing this information, we can evaluate the long-term usage patterns of the associated Open Source dataset.

Furthermore, this paper conducts extensive experiments based on five major dataset categories (Texts, Images, Videos, Audio, Medical) to demonstrate that the proposed method effectively evaluates the long-term usage patterns of Open Source datasets. Additionally, the insights gained from the experimental results can serve as a valuable reference for future researchers in selecting appropriate datasets for their work.

Advanced Deep Learning Models for Improving Movie Rating Predictions: A Benchmarking Study

Manisha Valera, Dr. Rahul Mehta
https://doi.org/10.1016/j.tbench.2025.100200

Abstract

Predicting movie ratings very precisely has become a vital aspect of personalized recommendation systems, which requires robust and high-performing models. for evaluating the effectiveness in predicting movie ratings, this study conducts a comprehensive performance analysis of various deep learning architectures, which includes BiLSTM, CNN + LSTM, CNN + GRU, CNN + Attention, CNN, VAE, Simple RNN, GRU + Attention, Transformer Encoder, FNN and ResNet. Here each model’s performance is evaluated on movie reviews’ dataset, enhanced with sentiment scores and user ratings, by using a range of evaluation metrics: Mean Absolute Error (MAE), R² score, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Explained Variance. Here the results highlight distinct strengths and weaknesses among the models, in which VAE model consistently delivering superior accuracy, whereas attention-based models prove prominent improvements in interpretability and generalization. This analysis offers important insights into choosing models for movie recommendation systems, which also highlights the balance between prediction accuracy and computational efficiency. The discoveries from this study serve as a benchmark for future developments in movie rating prediction, supporting the researchers and practitioners in augmenting recommendation system performance.

AI-powered Mathematical Sentiment Model and graph theory for social media trends

M. VENKATACHALAM, R․VIKRAMA PRASAD
https://doi.org/10.1016/j.tbench.2025.100202

Abstract

Significant issues have arisen as a result of the global spread of monkeypox, such as the extensive transmission of false information, public fear, and stigmatization on social media. Increased fear, prejudice, stigmatization of minority groups, and opposition to public health initiatives are frequently the results of these problems. Furthermore, health authorities are unable to provide correct information and prompt actions due to a lack of efficient methods for analyzing the enormous amounts of unstructured social media data. This disparity weakens crisis management initiatives and increases public skepticism of health guidelines. In order to address these issues, this study looks into the attitude around monkeypox on social media in order to pinpoint public worries, counter false information, and enhance communication tactics. The study intends to improve public comprehension, offer practical insights, and help health authorities manage the outbreak by fusing graph theory with AI-driven sentiment analysis. In order to facilitate semantic analysis of tweets through structured information extraction, graph theory is used to organize unstructured or semi-structured data by creating meaningful links between entities. Furthermore, opinions on monkeypox infection in social media are analyzed and user sentiments are detected using a reinforcement Markov decision process. According to experimental results, the suggested model's accuracy on the Monkeypox tweet dataset was 98 %. These results help raise awareness of monkeypox among the general population and promote an educated and robust social response.

Patrick Star: A comprehensive benchmark for multi-modal image editing

Di Cheng, ZhengXin Yang, ChunJie Luo, Chen Zheng, YingJie Shi
https://doi.org/10.1016/j.tbench.2025.100201

Abstract

Generative image editing enhances and automates traditional image designing methods. However, there is a significant imbalance in existing research, where the development of sketch-guided and example-guided image editing has not been sufficiently explored compared to text-guided image editing, despite the former being equally important in real-world applications. The leading cause of this phenomenon is the severe lack of corresponding benchmark datasets. To address this issue, this paper proposes a comprehensive and unified benchmark dataset, Patrick Star, which consists of approximately 500 test images, to promote balanced development in this field across multi-task and multi-modal settings. First, theoretical analysis grounded in Evaluatology highlights the importance of establishing a balanced benchmark dataset to advance research in image editing. Building on this theoretical foundation, the dataset’s construction methodology is explained in detail, ensuring it addresses critical gaps in existing studies. Next, statistical analyses are conducted to verify the dataset’s usability and diversity. Finally, comparative experiments underscore the dataset’s potential as a comprehensive benchmark, demonstrating its capacity to support balanced development in image editing.

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Volume 4, Issue 4 (December 2024)

Original Articles

Open Source Evaluatology: An evaluation framework and methodology for open source ecosystems based on evaluatology

Abstract

COADBench: A benchmark for revealing the relationship between AI models and clinical outcomes

Abstract

Evaluating long-term usage patterns of open source datasets: A citation network approach

Abstract

Advanced Deep Learning Models for Improving Movie Rating Predictions: A Benchmarking Study

Abstract

AI-powered Mathematical Sentiment Model and graph theory for social media trends

Abstract

Patrick Star: A comprehensive benchmark for multi-modal image editing

Abstract