BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Volume 4, Issue 2In progress (June 2024)

Download Volume 4, Issue 2

Editorial


A short summary of evaluatology: The science and engineering of evaluation

Jianfeng Zhan


Abstract

Evaluation is a crucial aspect of human existence and plays a vital role in each field. However, it is often approached in an empirical and ad-hoc manner, lacking consensus on universal concepts, terminologies, theories, and methodologies. This lack of agreement has significant consequences. This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation. The science of evaluation addresses the fundamental question: ”Does any evaluation outcome possess a true value?” The engineering of evaluation tackles the challenge of minimizing costs while satisfying the evaluation requirements of stakeholders. To address the above challenges, we propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines, if not all disciplines.

This is a short summary of Evaluatology (Zhan et al., 2024). The objective of this revised version is to alleviate the readers’ burden caused by the length of the original text. Compared to the original version (Zhan et al., 2024), this revised edition clarifies various concepts like evaluation systems and conditions and streamlines the concept system by eliminating the evaluation model concept. It rectifies errors, rephrases fundamental evaluation issues, and incorporates a case study on CPU evaluation (Wang et al., 2024). For a more comprehensive understanding, please refer to the original article (Zhan et al., 2024). If you wish to cite this work, kindly cite the original article.

Jianfeng Zhan, Lei Wang, Wanling Gao, Hongxiao Li, Chenxi Wang, Yunyou Huang, Yatao Li, Zhengxin Yang, Guoxin Kang, Chunjie Luo, Hainan Ye, Shaopeng Dai, Zhifei Zhang (2024). Evaluatology: The science and engineering of evaluation. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 4(1), 100162.


Original Articles


BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques

Peihua Zhang, Chenggang Wu, Zhe Wang


Abstract

The binary code similarity detection (BCSD) technique can quantitatively measure the differences between two given binaries and give matching results at predefined granularity (e.g., function), and has been widely used in multiple scenarios including software vulnerability search, security patch analysis, malware detection, code clone detection, etc. With the help of deep learning, the BCSD techniques have achieved high accuracy in their evaluation. However, on the one hand, their high accuracy has become indistinguishable due to the lack of a standard dataset, thus being unable to reveal their abilities. On the other hand, since binary code can be easily changed, it is essential to gain a holistic understanding of the underlying transformations including default optimization options, non-default optimization options, and commonly used code obfuscations, thus assessing their impact on the accuracy and adaptability of the BCSD technique. This paper presents our observations regarding the diversity of BCSD datasets and proposes a comprehensive dataset for the BCSD technique. We employ and present detailed evaluation results of various BCSD works, applying different classifications for different types of BCSD tasks, including pure function pairing and vulnerable code detection. Our results show that most BCSD works are capable of adopting default compiler options but are unsatisfactory when facing non-default compiler options and code obfuscation. We take a layered perspective on the BCSD task and point to opportunities for future optimizations in the technologies we consider.


Analyzing the impact of opportunistic maintenance optimization on manufacturing industries in Bangladesh: An empirical study

Md. Ariful Alam, Md. Rafiquzzaman, Md. Hasan Ali, Gazi Faysal Jubayer


Abstract

The study investigates the impact of opportunistic maintenance (OM) optimization on manufacturing industries, especially in Bangladesh, to reduce maintenance costs. To that end, OM strategies have been proposed and optimized for multi-unit manufacturing systems, whereas most of the existing research is for single- or two-unit systems. OM strategies in this research cover one of the three policies: preventive replacement, preventive repair, and a two-level maintenance approach. The proposed two-level maintenance approach is a combination of lower-level maintenance, known as preventive repair, and higher-level maintenance, known as preventive replacement. Simulation optimization (SO) techniques using Python were utilized to evaluate the strategies. Historical data from two of Bangladesh's most promising and significant sectors, the footwear and railway industries, was used as the case study. Compared to the currently utilized corrective maintenance approach, the two-level maintenance approach is the most effective for both case studies, demonstrating cost savings of 16.9 % and 22.4 % for the footwear and railway industries, respectively. This study reveals that manufacturing industries can achieve significant cost savings by implementing the proposed OM strategies, a concept that has yet to be explored in developing countries like Bangladesh. However, the study considered the proposed approaches for major components of the system, and more significant benefits can be achieved if it is possible to apply them to all critical components of the system.


Enhanced deep learning based decision support system for kidney tumour detection

Taha ETEM, Mustafa TEKE


Abstract

This study presents a high-accuracy deep learning-based decision support system for kidney cancer detection. The research utilizes a relatively large dataset of 10,000 CT images, including both healthy and tumour-detected kidney scans. After data preprocessing and optimization, various deep learning models were evaluated, with DenseNet-201 emerging as the top performer, achieving an accuracy of 99.75 %. The study compares multiple deep learning architectures, including AlexNet, EfficientNet, Darknet-53, Xception, and DenseNet-201, across different learning rates. Performance metrics such as accuracy, precision, sensitivity, F1-score, and specificity are analysed using confusion matrices. The proposed system outperforms different deep learning networks, demonstrating superior accuracy in kidney cancer detection. The improvement is attributed to effective data engineering and hyperparameter optimization of the deep learning networks. This research contributes to the field of medical image analysis by providing a robust decision support tool for early and rapid diagnosis of kidney cancer. The high accuracy and efficiency of the proposed system make it a promising aid for healthcare professionals in clinical settings.