How do you turn a vague business style question over messy folders of CSV, JSON and text into reliable Python code without a human analyst in the loop? Google researchers introduce DS STAR (Data Science Agent via Iterative Planning and Verification), a multi agent framework that turns open ended data science questions into executable Python scripts over heterogeneous files. Instead of assuming a clean SQL database and a single query, DS STAR treats the problem as Text to Python and operates directly on mixed formats such as CSV, JSON, Markdown and unstructured text.

From Text To Python Over Heterogeneous Data
Existing data science agents often rely on Text to SQL over relational databases. This constraint limits them to structured tables and simple schema, which does not match many enterprise environments where data sits across documents, spreadsheets and logs.
DS STAR changes the abstraction. It generates Python code that loads and combines whatever files the benchmark provides. The system first summarizes every file, then uses that context to plan, implement and verify a multi step solution. This design allows DS STAR to work on benchmarks such as DABStep, KramaBench and DA Code, which expect multi step analysis over mixed file types and require answers in strict formats.


Stage 1: Data File Analysis With Aanalyzer
The first stage builds a structured view of the data lake. For each file (Dᵢ), the Aanalyzer agent generates a Python script (sᵢ_desc) that parses the file and prints essential information such as column names, data types, metadata and text summaries. DS STAR executes this script and captures the output as a concise description (dᵢ).
This process works for both structured and unstructured data. CSV files yield column level statistics and samples, while JSON or text files produce structural summaries and key snippets. The collection {dᵢ} becomes shared context for all later agents.


Stage 2: Iterative Planning, Coding And Verification
After file analysis, DS STAR runs an iterative loop that mirrors how a human uses a notebook.
Aplanner is conditioned on the latest execution result (rₖ), so each new step explicitly responds to what went wrong in the previous attempt. The loop of routing, planning, coding, executing and verifying continues until Averifier marks the plan sufficient or the system hits a maximum of 20 refinement rounds.


To satisfy strict benchmark formats, a separate Afinalyzer agent converts the final plan into solution code that enforces rules such as rounding and CSV output.
Robustness Modules, Adebugger And Retriever
Realistic pipelines fail on schema drift and missing columns. DS STAR adds Adebugger to repair broken scripts. When code fails, Adebugger receives the script, the traceback and the analyzer descriptions {dᵢ}. It generates a corrected script by conditioning on all three signals, which is important because many data centric bugs require knowledge of column headers, sheet names or schema, not only the stack trace.
KramaBench introduces another challenge, thousands of candidate files per domain. DS STAR handles this with a Retriever. The system embeds the user query and each description (dᵢ) using a pre trained embedding model and selects the top 100 most similar files for the agent context, or all files if there are fewer than 100. In the implementation, the research team used Gemini Embedding 001 for similarity search.


Benchmark Results On DABStep, KramaBench And DA Code
All main experiments run DS STAR with Gemini 2.5 Pro as the base LLM and allow up to 20 refinement rounds per task.
On DABStep, model only Gemini 2.5 Pro achieves 12.70 percent hard level accuracy. DS STAR with the same model reaches 45.24 percent on hard tasks and 87.50 percent on easy tasks. This is an absolute gain of more than 32 percentage points on the hard split and it outperforms other agents such as ReAct, AutoGen, Data Interpreter, DA Agent and several commercial systems recorded on the public leaderboard.


The Google research team reports that, compared to the best alternative system on each benchmark, DS STAR improves overall accuracy from 41.0 percent to 45.2 percent on DABStep, from 39.8 percent to 44.7 percent on KramaBench and from 37.0 percent to 38.5 percent on DA Code.


For KramaBench, which requires retrieving relevant files from large domain specific data lakes, DS STAR with retrieval and Gemini 2.5 Pro achieves a total normalized score of 44.69. The strongest baseline, DA Agent with the same model, reaches 39.79.


On DA Code, DS STAR again beats DA Agent. On hard tasks, DS STAR reaches 37.1 percent accuracy versus 32.0 percent for DA Agent when both use Gemini 2.5 Pro.
Key Takeaways
DS STAR shows that practical data science automation needs explicit structure around large language models, not only better prompts. The combination of Aanalyzer, Averifier, Arouter and Adebugger turns free form data lakes into a controlled Text to Python loop that is measurable on DABStep, KramaBench and DA Code, and portable across Gemini 2.5 Pro and GPT 5. This work moves data agents from table demos toward benchmarked, end to end analytics systems.
Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.



