Journal | BenchCouncil

Abstract

Large Language Models (LLMs) like GPT-3 and GPT-4 have emerged as groundbreaking innovations with capabilities that extend far beyond traditional AI applications. These sophisticated models, trained on massive datasets, can generate human-like text, respond to complex queries, and even write and interpret code. Their potential to revolutionize software development has captivated the software engineering (SE) community, sparking debates about their transformative impact. Through a critical analysis of technical strengths, limitations, real-world case studies, and future research directions, this paper argues that LLMs are not just reshaping how software is developed but are redefining the role of developers. While challenges persist, LLMs offer unprecedented opportunities for innovation and collaboration. Early adoption of LLMs in software engineering is crucial to stay competitive in this rapidly evolving landscape. This paper serves as a guide, helping developers, organizations, and researchers understand how to harness the power of LLMs to streamline workflows and acquire the necessary skills.

Abstract

Large Language Models (LLMs) like GPT-3 and GPT-4 have emerged as groundbreaking innovations with capabilities that extend far beyond traditional AI applications. These sophisticated models, trained on massive datasets, can generate human-like text, respond to complex queries, and even write and interpret code. Their potential to revolutionize software development has captivated the software engineering (SE) community, sparking debates about their transformative impact. Through a critical analysis of technical strengths, limitations, real-world case studies, and future research directions, this paper argues that LLMs are not just reshaping how software is developed but are redefining the role of developers. While challenges persist, LLMs offer unprecedented opportunities for innovation and collaboration. Early adoption of LLMs in software engineering is crucial to stay competitive in this rapidly evolving landscape. This paper serves as a guide, helping developers, organizations, and researchers understand how to harness the power of LLMs to streamline workflows and acquire the necessary skills.

Abstract

Retrosynthetic analysis is highly significant in chemistry, biology, and materials science, providing essential support for the rational design, synthesis, and optimization of compounds across diverse Artificial Intelligence for Science (AI4S) applications. Retrosynthetic analysis focuses on exploring pathways from products to reactants, and this is typically conducted using deep learning-based generative models. However, existing retrosynthetic analysis often overlooks how reaction conditions significantly impact chemical reactions. This causes existing work to lack unified models that can provide full-cycle services for retrosynthetic analysis, and also greatly limits the overall prediction accuracy of retrosynthetic analysis. These two issues cause users to depend on various independent models and tools, leading to high labor time and cost overhead.

To solve these issues, we define the boundary conditions of chemical reactions based on the Evaluatology theory and propose BigTensorDB, the first tensor database which integrates storage, prediction generation, search, and analysis functions. BigTensorDB designs the tensor schema for efficiently storing all the key information related to chemical reactions, including reaction conditions. BigTensorDB supports a full-cycle retrosynthetic analysis pipeline. It begins with predicting generation reaction paths, searching for approximate real reactions based on the tensor schema, and concludes with feasibility analysis, which enhances the interpretability of prediction results. BigTensorDB can effectively reduce usage costs and improve efficiency for users during the full-cycle retrosynthetic analysis process. Meanwhile, it provides a potential solution to the low accuracy issue, encouraging researchers to focus on improving full-cycle accuracy.

Abstract

One of the main problems in call centers is the call queue. This can lead to long waiting times for customers, increased frustration and call abandonment. The important role that predictive analytics plays in optimizing call center operations is increasingly recognized. Advanced models can be trained by training datasets such as the number of calls that have occurred throughout history, and by estimating how religious and public holidays have affected the weight of hours and the number of calls, and this study provides an analysis of 4 years. Call center data from Shatel, an Internet service provider. Predictive deep learning models, specifically the Bidirectional Short-Term Memory Model (BLSTM), were used to predict the number of incoming calls, predict the number of calls to centers, and prevent call queues with an accuracy of 90.56.

Abstract

The study analyzes the global regulatory landscape for blockchain assets, particularly cryptocurrencies and non-fungible tokens, focusing on the motivations behind policymaker actions, the diversity of regulatory approaches, the challenges posed by decentralized technologies and provide future regulatory pathways. The study uses a conceptual and mixed-method approach, combining qualitative and quantitative content analysis of 59 peer-reviewed articles selected through the PRISMA framework. Findings reveal that regulation is primarily driven by concerns over consumer protection, financial stability, anti-money laundering, taxation, and environmental sustainability. Regulatory responses vary widely, ranging from the harmonized MiCA framework in the EU to the fragmented enforcement model in the U.S., along with diverse strategies across Asia. Stablecoins, DeFi, and CBDCs emerge as major regulatory frontiers. The study recommends adopting regulatory sandboxes, promoting international coordination, enforcing environmental standards, and building regulatory capacity in emerging economies to balance innovation with risk mitigation. It also highlights the importance of industry self-regulation and technology-assisted compliance in decentralized finance. The limitation of this study is that it relies solely on secondary data sources, which may limit the accuracy of real-time policy impact assessments. Future research should focus on empirical validation and dynamic policy modeling to enhance global governance of digital assets.

Abstract

Machine learning significantly enhances clinical decision-making quality, directly impacting patient care with early diagnosis, personalized treatment, and predictive analytics. Nonetheless, the increasing proliferation of such ML applications in practice raises potential ethical and regulatory obstacles that may prevent their widespread adoption in healthcare. Key issues concern patient data privacy, algorithmic bias, absence of transparency, and ambiguous legal liability. Fortunately, regulations like the General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA), and the FDA AI/ML guidance have raised important ways of addressing things like fairness, explainability, legal compliance, etc.; however, the landscape is far from risk-free. AI liability is another one of the gray areas approaching black, worrying about who is liable for an AI medical error — the developers, the physicians, or the institutions. The study reviews ethical risks and potential opportunities, as well as regulatory frameworks and emerging challenges in AI-driven healthcare. It proposes solutions to reduce bias, improve transparency, and enhance legal accountability. This research addresses these challenges to support the safe, fair, and effective deployment of ML-based systems in clinical practice, guaranteeing that patients can trust, regulators can approve, and healthcare can use them.

Abstract

AICB (Artificial Intelligence Communication Benchmark) is a benchmark for evaluating the communication subsystem of GPU clusters, which includes representative workloads in the fields of Large Language Model (LLM) training. Guided by the theories and methodologies of Evaluatology, we simplified the real-workload LLM training systems through AICB that maintain good representativeness and usability. AICB bridges the gap between application benchmarks and microbenchmarks in the scope of LLM training. In addition, we constructed a new GPU-free evaluation system that helps researchers evaluate the communication system of the LLM training systems. To help the urgent demand on this evaluation subject, we open-source AICB and make it available at https://github.com/aliyun/aicb .

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Volume 5, Issue 1In progress (March 2025)

Original Articles

LLMs: A game-changer for software engineers?

Abstract

Evaluatology’s perspective on AI evaluation in critical scenarios: From tail quality to landscape

Abstract

Tensor databases empower AI for science: A case study on retrosynthetic analysis

Abstract

Predicting the number of call center incoming calls using deep learning

Abstract

Review Articles

Regulatory landscape of blockchain assets: Analyzing the drivers of NFT and cryptocurrency regulation

Abstract

Ethical and regulatory challenges in machine learning-based healthcare systems: A review of implementation barriers and future directions

Abstract

AICB: A benchmark for evaluating the communication subsystem of LLM training clusters

Abstract