什么是 AI Data Scientists?

12 minute read

Published:

什么是 AI Data Scientists

AI Data Scientists 是以 AI/LLM 为决策核心,自动完成端到端数据科学任务的智能体。

通常,它会调用代码解释器、数据库、可视化和机器学习工具,把“理解问题、感知数据、清洗特征、建模评估、生成报告”连成一个可执行闭环。

有什么用?

  1. 科学发现: 自动阅读数据、提出假设、运行实验、复现实证结果。
  2. 工业建模: 从业务数据到预测模型,自动完成清洗、特征、训练、评估和部署前检查。
  3. 商业分析: 面向自然语言问题生成 SQL、图表、归因分析和可解释报告。
  4. 个人助手: 帮个人处理表格、财务、健康、学习记录等小型数据任务。

与传统 AutoML 的区别

传统 AutoML 本质上更像一个被“人/Agent”调用的工具,主要加速模型选择、超参数调优、特征处理等局部环节;它通常依赖明确的数据、任务和评价指标,难以处理开放式建模问题,也无法完整覆盖从问题定义到报告生成的端到端流程。

AI Data Scientists 则更像任务主体:它可以理解开放世界中的分析目标,主动探索数据、调用工具、规划工作流、修正错误,并与人类协作或在监督下独立完成端到端数据科学任务。

核心框架

AI Data Scientists 可以看成三层协同:

  1. Data-World Perception: 感知真实数据世界,包括表格、文本、图像、时间序列、多源数据库,以及缺失、噪声、不一致等脏数据。
  2. Agentic Data Reasoning: 对数据科学方法做决策,包括任务拆解、建模选择、工作流规划、代码生成、实验反馈学习。
  3. System-Level Operation: 在真实系统中稳定执行,包括沙箱、工具调用、数据库连接、资源管理、日志、监控和安全治理。

Framework of autonomous data science agents

关键挑战

数据感知很难。 真实数据往往没有干净 schema,存在多模态、多表、多源、缺失值、异常值和语义不一致。Agent 不只要“读到数据”,还要知道哪些数据可信、哪些变量有意义、哪些转换会改变结论。

复杂建模很难。 数据科学不是单步代码生成,而是长链路决策:选择问题形式、构造特征、比较模型、设计验证、解释误差、避免泄漏。任何一步错了,最后的高分或漂亮图表都可能是假的。

可靠执行很难。 Agent 会写代码、改数据、调用外部工具,也会犯错。因此需要可复现环境、权限控制、自动审计、失败恢复和人类确认机制。

总结

当前 AI Data Scientists 仍处于早期阶段。如何系统评估其真实能力,如何构造更强、更通用、更可靠的 agent,以及如何让它们在开放世界任务中与人类有效协作,仍然需要进一步探索。

Benchmark 与评估

目前还没有一个 benchmark 能完整覆盖真实的数据科学流程:DS-1000 更偏代码生成,MLE-bench 更偏 Kaggle 式机器学习工程,DABstep 和 KramaBench 更接近真实业务分析,而 DSGym、ResearchGym 进一步把工具、环境、执行轨迹和审计纳入评估。

BenchmarkYearPrimary FocusTask/Data ScaleCore Evaluation Style
DS-1000 [3]2023Data-science code generation1,000 problems spanning seven Python librariesExecutable tests and surface-form constraints; reliability against false positives
InfiAgent-DABench [4]2024Agentic data analysis on CSV filesApproximately 311 public validation questions over 55 CSV files; broader benchmark framework around DAEvalClosed-form correctness with sandboxed execution; accuracy metric
DSEval [5]2024Whole data-science lifecycleFour benchmark families: Exercise, Stack Overflow, LeetCode, and Kaggle; 825 total problems across setsPass rate, error propagation, intactness, and presentation-error variants
TaskWeaver [6]2024Stateful data-analytics agent framework23 internal evaluation cases plus transformed DS-1000, DABench, and DSEval tasksNormalized aggregate score across benchmark suites
MLAgentBench [7]2024End-to-end ML experimentation13 tasks across image, text, graph, tabular, time series, and recent research datasetsSuccess rate based on final artifact quality in controlled environments
DSBench [8]2024Realistic data analysis and modeling466 analysis tasks and 74 Kaggle-style modeling tasksAccuracy for analysis; RPG and task success for modeling; human baselines included
DA-Code [9]2024Agent data-science code generation500 wrangling, ML, and EDA tasksTotal score, completion rate, executable-code rate
MLE-bench [10]2025ML engineering via Kaggle competitions75 competitions; Lite split of 22 competitionsKaggle-style scoring mapped to leaderboard percentiles and Any Medal percentage
DataSciBench [11]2025Broad data-science prompt benchmark with fine-grained metrics23 evaluated models; 519 TFC test cases derived from curated promptsSuccess rate, completion rate, VLM judging, 25 aggregate-function metrics, and total score
DABstep [12]2025Realistic multi-step data analysis with heterogeneous documents450+ enterprise data-analysis tasks from AdyenFactoid-style automatic correctness; hard/easy split; cost reporting; public leaderboard
KramaBench [13]2025Data-to-insight pipelines over data lakes104 pipelines, 1,700 files, 24 sources, and 6 domainsEnd-to-end and subtask evaluation; exact/approximate string and numeric scoring; runtime per task
ScienceAgentBench [14]2025Data-driven scientific discovery workflows102 tasks from 44 papers in four disciplinesGenerated Python programs scored by execution, results, and cost; multiple frameworks per model
PaperBench [15]2025Replicating frontier ML research papers20 ICML 2024 spotlight/oral papersHierarchical research-task rubrics for replication
AgentDS [16]2026Domain-specific data science and human-AI collaboration17 challenges across 6 industries; 29 teams and 80 participantsCompetition ranking; AI-only versus human-AI comparison
DSGym [17]2026Standardized evaluation and training frameworkStandardized/refined benchmark suite plus DSBio and DSPredict; 2,000-example training case studyUnified execution environments; quality and shortcut-solvability filtering
ResearchGym [18]2026Real-world AI research agents5 research environments and 39 subtasksImprovement over strong baselines plus subtask completion in fixed containers

参考文献

  1. Tang et al. LLM/Agent-as-Data-Analyst: A Survey. arXiv, 2025.
  2. Cao et al. Spider2-V: How Far Are Multimodal Agents from Automating Data Science and Engineering Workflows? NeurIPS, 2024.
  3. Lai et al. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. ICML, 2023.
  4. Hu et al. InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. arXiv, 2024.
  5. Zhang et al. Benchmarking Data Science Agents. ACL, 2024.
  6. Qiao et al. TaskWeaver: A Code-First Agent Framework. arXiv, 2023.
  7. Huang et al. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. arXiv, 2023.
  8. Jing et al. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? arXiv, 2024.
  9. Huang et al. DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models. EMNLP, 2024.
  10. Chan et al. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. ICLR, 2025.
  11. Zhang et al. DataSciBench: An LLM Agent Benchmark for Data Science. arXiv, 2025.
  12. Egg et al. DABstep: Data Agent Benchmark for Multi-Step Reasoning. arXiv, 2025.
  13. Lai et al. KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes. arXiv, 2025.
  14. Chen et al. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery. ICLR, 2025.
  15. Starace et al. PaperBench: Evaluating AI’s Ability to Replicate AI Research. arXiv, 2025.
  16. Luo et al. AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science. arXiv, 2026.
  17. Nie et al. DSGym: A Holistic Framework for Evaluating and Training Data Science Agents. arXiv, 2026.
  18. Garikaparthi et al. ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv, 2026.

BibTeX

@misc{liu2026autonomousdatascience,
  title  = {Towards Autonomous Data Science with LLM Agents: A Survey and Outlook},
  author = {Liu, Fan and Han, Jindong and Lyu, Tengfei and Yang, Zherui and Liu, Hao},
  year   = {2026},
  url    = {https://luckyfan-cs.github.io/posts/2026/05/ai-data-scientists/}
}