什么是 AI Data Scientists？

12 minute read

Published: May 27, 2026

什么是 AI Data Scientists

AI Data Scientists 是以 AI/LLM 为决策核心，自动完成端到端数据科学任务的智能体。

通常，它会调用代码解释器、数据库、可视化和机器学习工具，把“理解问题、感知数据、清洗特征、建模评估、生成报告”连成一个可执行闭环。

有什么用？

科学发现： 自动阅读数据、提出假设、运行实验、复现实证结果。
工业建模： 从业务数据到预测模型，自动完成清洗、特征、训练、评估和部署前检查。
商业分析： 面向自然语言问题生成 SQL、图表、归因分析和可解释报告。
个人助手： 帮个人处理表格、财务、健康、学习记录等小型数据任务。

与传统 AutoML 的区别

传统 AutoML 本质上更像一个被“人/Agent”调用的工具，主要加速模型选择、超参数调优、特征处理等局部环节；它通常依赖明确的数据、任务和评价指标，难以处理开放式建模问题，也无法完整覆盖从问题定义到报告生成的端到端流程。

AI Data Scientists 则更像任务主体：它可以理解开放世界中的分析目标，主动探索数据、调用工具、规划工作流、修正错误，并与人类协作或在监督下独立完成端到端数据科学任务。

核心框架

AI Data Scientists 可以看成三层协同：

Data-World Perception： 感知真实数据世界，包括表格、文本、图像、时间序列、多源数据库，以及缺失、噪声、不一致等脏数据。
Agentic Data Reasoning： 对数据科学方法做决策，包括任务拆解、建模选择、工作流规划、代码生成、实验反馈学习。
System-Level Operation： 在真实系统中稳定执行，包括沙箱、工具调用、数据库连接、资源管理、日志、监控和安全治理。

Framework of autonomous data science agents

关键挑战

数据感知很难。 真实数据往往没有干净 schema，存在多模态、多表、多源、缺失值、异常值和语义不一致。Agent 不只要“读到数据”，还要知道哪些数据可信、哪些变量有意义、哪些转换会改变结论。

复杂建模很难。 数据科学不是单步代码生成，而是长链路决策：选择问题形式、构造特征、比较模型、设计验证、解释误差、避免泄漏。任何一步错了，最后的高分或漂亮图表都可能是假的。

可靠执行很难。 Agent 会写代码、改数据、调用外部工具，也会犯错。因此需要可复现环境、权限控制、自动审计、失败恢复和人类确认机制。

总结

当前 AI Data Scientists 仍处于早期阶段。如何系统评估其真实能力，如何构造更强、更通用、更可靠的 agent，以及如何让它们在开放世界任务中与人类有效协作，仍然需要进一步探索。

Benchmark 与评估

目前还没有一个 benchmark 能完整覆盖真实的数据科学流程：DS-1000 更偏代码生成，MLE-bench 更偏 Kaggle 式机器学习工程，DABstep 和 KramaBench 更接近真实业务分析，而 DSGym、ResearchGym 进一步把工具、环境、执行轨迹和审计纳入评估。

Benchmark	Year	Primary Focus	Task/Data Scale	Core Evaluation Style
DS-1000 [3]	2023	Data-science code generation	1,000 problems spanning seven Python libraries	Executable tests and surface-form constraints; reliability against false positives
InfiAgent-DABench [4]	2024	Agentic data analysis on CSV files	Approximately 311 public validation questions over 55 CSV files; broader benchmark framework around DAEval	Closed-form correctness with sandboxed execution; accuracy metric
DSEval [5]	2024	Whole data-science lifecycle	Four benchmark families: Exercise, Stack Overflow, LeetCode, and Kaggle; 825 total problems across sets	Pass rate, error propagation, intactness, and presentation-error variants
TaskWeaver [6]	2024	Stateful data-analytics agent framework	23 internal evaluation cases plus transformed DS-1000, DABench, and DSEval tasks	Normalized aggregate score across benchmark suites
MLAgentBench [7]	2024	End-to-end ML experimentation	13 tasks across image, text, graph, tabular, time series, and recent research datasets	Success rate based on final artifact quality in controlled environments
DSBench [8]	2024	Realistic data analysis and modeling	466 analysis tasks and 74 Kaggle-style modeling tasks	Accuracy for analysis; RPG and task success for modeling; human baselines included
DA-Code [9]	2024	Agent data-science code generation	500 wrangling, ML, and EDA tasks	Total score, completion rate, executable-code rate
MLE-bench [10]	2025	ML engineering via Kaggle competitions	75 competitions; Lite split of 22 competitions	Kaggle-style scoring mapped to leaderboard percentiles and Any Medal percentage
DataSciBench [11]	2025	Broad data-science prompt benchmark with fine-grained metrics	23 evaluated models; 519 TFC test cases derived from curated prompts	Success rate, completion rate, VLM judging, 25 aggregate-function metrics, and total score
DABstep [12]	2025	Realistic multi-step data analysis with heterogeneous documents	450+ enterprise data-analysis tasks from Adyen	Factoid-style automatic correctness; hard/easy split; cost reporting; public leaderboard
KramaBench [13]	2025	Data-to-insight pipelines over data lakes	104 pipelines, 1,700 files, 24 sources, and 6 domains	End-to-end and subtask evaluation; exact/approximate string and numeric scoring; runtime per task
ScienceAgentBench [14]	2025	Data-driven scientific discovery workflows	102 tasks from 44 papers in four disciplines	Generated Python programs scored by execution, results, and cost; multiple frameworks per model
PaperBench [15]	2025	Replicating frontier ML research papers	20 ICML 2024 spotlight/oral papers	Hierarchical research-task rubrics for replication
AgentDS [16]	2026	Domain-specific data science and human-AI collaboration	17 challenges across 6 industries; 29 teams and 80 participants	Competition ranking; AI-only versus human-AI comparison
DSGym [17]	2026	Standardized evaluation and training framework	Standardized/refined benchmark suite plus DSBio and DSPredict; 2,000-example training case study	Unified execution environments; quality and shortcut-solvability filtering
ResearchGym [18]	2026	Real-world AI research agents	5 research environments and 39 subtasks	Improvement over strong baselines plus subtask completion in fixed containers

参考文献

Tang et al. LLM/Agent-as-Data-Analyst: A Survey. arXiv, 2025.
Cao et al. Spider2-V: How Far Are Multimodal Agents from Automating Data Science and Engineering Workflows? NeurIPS, 2024.
Lai et al. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. ICML, 2023.
Hu et al. InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. arXiv, 2024.
Zhang et al. Benchmarking Data Science Agents. ACL, 2024.
Qiao et al. TaskWeaver: A Code-First Agent Framework. arXiv, 2023.
Huang et al. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. arXiv, 2023.
Jing et al. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? arXiv, 2024.
Huang et al. DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models. EMNLP, 2024.
Chan et al. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. ICLR, 2025.
Zhang et al. DataSciBench: An LLM Agent Benchmark for Data Science. arXiv, 2025.
Egg et al. DABstep: Data Agent Benchmark for Multi-Step Reasoning. arXiv, 2025.
Lai et al. KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes. arXiv, 2025.
Chen et al. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery. ICLR, 2025.
Starace et al. PaperBench: Evaluating AI’s Ability to Replicate AI Research. arXiv, 2025.
Luo et al. AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science. arXiv, 2026.
Nie et al. DSGym: A Holistic Framework for Evaluating and Training Data Science Agents. arXiv, 2026.
Garikaparthi et al. ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv, 2026.

BibTeX

@misc{liu2026autonomousdatascience,
  title  = {Towards Autonomous Data Science with LLM Agents: A Survey and Outlook},
  author = {Liu, Fan and Han, Jindong and Lyu, Tengfei and Yang, Zherui and Liu, Hao},
  year   = {2026},
  url    = {https://luckyfan-cs.github.io/posts/2026/05/ai-data-scientists/}
}

What Are AI Data Scientists?

AI Data Scientists are agents that use AI/LLMs as the decision-making core to complete end-to-end data science tasks automatically.

In practice, they often call code interpreters, databases, visualization tools, and machine-learning toolkits, connecting problem understanding, data perception, cleaning, feature engineering, modeling, evaluation, and reporting into an executable loop.

What Are They Useful For?

Scientific discovery: reading data, forming hypotheses, running experiments, and reproducing empirical findings.
Industrial modeling: moving from business data to predictive models through automated cleaning, features, training, evaluation, and pre-deployment checks.
Business analytics: generating SQL, charts, attribution analysis, and interpretable reports from natural-language questions.
Personal assistants: helping individuals work with spreadsheets, finance logs, health records, learning traces, and other small datasets.

How Is This Different from Traditional AutoML?

Traditional AutoML is essentially a tool used by a human or an agent. It mainly accelerates local steps such as model selection, hyperparameter tuning, and feature processing. It usually assumes a well-defined dataset, task, and metric, so it struggles with open-ended modeling and only partially speeds up the broader data-science workflow.

AI Data Scientists are closer to task-level actors. They can interpret open-world analytical goals, explore data, call tools, plan workflows, fix errors, and complete end-to-end data science tasks either in collaboration with humans or autonomously under supervision.

Core Framework

AI Data Scientists can be viewed as a three-layer system:

Data-World Perception: understanding real data worlds, including tables, text, images, time series, multi-source databases, missing values, noise, and inconsistent semantics.
Agentic Data Reasoning: making data-science decisions, including task decomposition, modeling choices, workflow planning, code generation, and learning from experimental feedback.
System-Level Operation: executing reliably in real systems, including sandboxes, tool calls, database connections, resource management, logs, monitoring, and safety governance.

Framework of autonomous data science agents

Key Challenges

Data perception is hard. Real data rarely comes with clean schemas. It is often multimodal, multi-table, multi-source, incomplete, noisy, and semantically inconsistent. An agent must know not only what data exists, but also what is trustworthy, what variables matter, and which transformations may change the conclusion.

Complex modeling is hard. Data science is not one-step code generation. It is a long-horizon decision process: formulating the problem, constructing features, comparing models, designing validation, explaining errors, and avoiding leakage. A single wrong step can make the final score or chart misleading.

Reliable execution is hard. Agents write code, modify data, and call external tools, so they can also make operational mistakes. Deployment requires reproducible environments, access control, automatic auditing, failure recovery, and human confirmation when needed.

Summary

AI Data Scientists are still at an early stage. How to evaluate their real capabilities, how to build stronger, more general, and more reliable agents, and how to make them collaborate effectively with humans in open-world tasks remain open questions.

Benchmarks and Evaluation

No single benchmark fully covers the real data-science workflow today. DS-1000 focuses more on code generation, MLE-bench on Kaggle-style ML engineering, DABstep and KramaBench are closer to realistic business analytics, while DSGym and ResearchGym further include tools, environments, execution traces, and auditing.

Benchmark	Year	Primary Focus	Task/Data Scale	Core Evaluation Style
DS-1000 [3]	2023	Data-science code generation	1,000 problems spanning seven Python libraries	Executable tests and surface-form constraints; reliability against false positives
InfiAgent-DABench [4]	2024	Agentic data analysis on CSV files	Approximately 311 public validation questions over 55 CSV files; broader benchmark framework around DAEval	Closed-form correctness with sandboxed execution; accuracy metric
DSEval [5]	2024	Whole data-science lifecycle	Four benchmark families: Exercise, Stack Overflow, LeetCode, and Kaggle; 825 total problems across sets	Pass rate, error propagation, intactness, and presentation-error variants
TaskWeaver [6]	2024	Stateful data-analytics agent framework	23 internal evaluation cases plus transformed DS-1000, DABench, and DSEval tasks	Normalized aggregate score across benchmark suites
MLAgentBench [7]	2024	End-to-end ML experimentation	13 tasks across image, text, graph, tabular, time series, and recent research datasets	Success rate based on final artifact quality in controlled environments
DSBench [8]	2024	Realistic data analysis and modeling	466 analysis tasks and 74 Kaggle-style modeling tasks	Accuracy for analysis; RPG and task success for modeling; human baselines included
DA-Code [9]	2024	Agent data-science code generation	500 wrangling, ML, and EDA tasks	Total score, completion rate, executable-code rate
MLE-bench [10]	2025	ML engineering via Kaggle competitions	75 competitions; Lite split of 22 competitions	Kaggle-style scoring mapped to leaderboard percentiles and Any Medal percentage
DataSciBench [11]	2025	Broad data-science prompt benchmark with fine-grained metrics	23 evaluated models; 519 TFC test cases derived from curated prompts	Success rate, completion rate, VLM judging, 25 aggregate-function metrics, and total score
DABstep [12]	2025	Realistic multi-step data analysis with heterogeneous documents	450+ enterprise data-analysis tasks from Adyen	Factoid-style automatic correctness; hard/easy split; cost reporting; public leaderboard
KramaBench [13]	2025	Data-to-insight pipelines over data lakes	104 pipelines, 1,700 files, 24 sources, and 6 domains	End-to-end and subtask evaluation; exact/approximate string and numeric scoring; runtime per task
ScienceAgentBench [14]	2025	Data-driven scientific discovery workflows	102 tasks from 44 papers in four disciplines	Generated Python programs scored by execution, results, and cost; multiple frameworks per model
PaperBench [15]	2025	Replicating frontier ML research papers	20 ICML 2024 spotlight/oral papers	Hierarchical research-task rubrics for replication
AgentDS [16]	2026	Domain-specific data science and human-AI collaboration	17 challenges across 6 industries; 29 teams and 80 participants	Competition ranking; AI-only versus human-AI comparison
DSGym [17]	2026	Standardized evaluation and training framework	Standardized/refined benchmark suite plus DSBio and DSPredict; 2,000-example training case study	Unified execution environments; quality and shortcut-solvability filtering
ResearchGym [18]	2026	Real-world AI research agents	5 research environments and 39 subtasks	Improvement over strong baselines plus subtask completion in fixed containers

References

Tang et al. LLM/Agent-as-Data-Analyst: A Survey. arXiv, 2025.
Cao et al. Spider2-V: How Far Are Multimodal Agents from Automating Data Science and Engineering Workflows? NeurIPS, 2024.
Lai et al. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. ICML, 2023.
Hu et al. InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks. arXiv, 2024.
Zhang et al. Benchmarking Data Science Agents. ACL, 2024.
Qiao et al. TaskWeaver: A Code-First Agent Framework. arXiv, 2023.
Huang et al. MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation. arXiv, 2023.
Jing et al. DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? arXiv, 2024.
Huang et al. DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models. EMNLP, 2024.
Chan et al. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. ICLR, 2025.
Zhang et al. DataSciBench: An LLM Agent Benchmark for Data Science. arXiv, 2025.
Egg et al. DABstep: Data Agent Benchmark for Multi-Step Reasoning. arXiv, 2025.
Lai et al. KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes. arXiv, 2025.
Chen et al. ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery. ICLR, 2025.
Starace et al. PaperBench: Evaluating AI’s Ability to Replicate AI Research. arXiv, 2025.
Luo et al. AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science. arXiv, 2026.
Nie et al. DSGym: A Holistic Framework for Evaluating and Training Data Science Agents. arXiv, 2026.
Garikaparthi et al. ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv, 2026.

BibTeX

@misc{liu2026autonomousdatascience,
  title  = {Towards Autonomous Data Science with LLM Agents: A Survey and Outlook},
  author = {Liu, Fan and Han, Jindong and Lyu, Tengfei and Yang, Zherui and Liu, Hao},
  year   = {2026},
  url    = {https://luckyfan-cs.github.io/posts/2026/05/ai-data-scientists/}
}

Share on

Twitter Facebook LinkedIn

Fan LIU

什么是 AI Data Scientists

有什么用？

与传统 AutoML 的区别

核心框架

关键挑战

总结

Benchmark 与评估

参考文献

BibTeX

What Are AI Data Scientists?

What Are They Useful For?

How Is This Different from Traditional AutoML?

Core Framework

Key Challenges

Summary

Benchmarks and Evaluation

References

BibTeX

Share on