Machine Learning Research Scientist, Reasoning — Scale AI

Scale AI · San Fransisco · Hybrid

No longer accepting applications

Type: FULL-TIME
Salary: USD 252,000 – 315,000/yr
Posted: 2 months ago

This role is no longer accepting applications. Browse other open roles, or get started with our career tools below.

Job Description

Help Shape the Future of AI Reasoning and Agentic Intelligence

Scale AI is hiring a Machine Learning Research Scientist, Reasoning to advance the next generation of large language model (LLM) reasoning systems, AI agents, and scalable evaluation frameworks.

This role sits at the cutting edge of AI research and practical implementation, focusing on improving reasoning capabilities within LLMs, browser agents, coding agents, software engineering agents, and autonomous AI systems.

You will collaborate with world-class researchers and engineers to develop advanced data strategies, reasoning methodologies, and agentic workflows that accelerate progress toward Artificial General Intelligence (AGI).

This is an ideal opportunity for machine learning researchers, NLP scientists, reasoning specialists, and LLM engineers passionate about frontier AI research, agentic reasoning, and large-scale model development.

About Scale AI

Scale AI is one of the world’s leading AI infrastructure and data platforms, powering advancements in:

Generative AI
Autonomous vehicles
Defense technologies
Enterprise AI systems
Government AI applications
Large language model evaluation

For more than eight years, Scale AI has helped build reliable AI systems through high-quality training data, model evaluation, and scalable AI infrastructure.

Following its recent Series F funding round, Scale AI continues expanding its capabilities to advance AGI development and establish new standards in AI model evaluation and reasoning.

Scale partners with organizations including:

Meta
Cisco
DLA Piper
Mayo Clinic
U.S. government agencies including the Army and Air Force

Role Overview

As a Machine Learning Research Scientist focused on reasoning, you will study and develop the data, architectures, and methodologies necessary to improve advanced LLM reasoning and agent behavior.

You will help define:

High-quality reasoning datasets
Agent evaluation methodologies
Novel planning and reasoning strategies
Scalable data generation systems
Real-world AI deployment workflows

The role combines:

Frontier AI research
Experimental prototyping
Model evaluation
Cross-functional collaboration
Production-oriented machine learning development

You will work closely with engineering and research teams to transform cutting-edge research into scalable AI systems used in real-world applications.

Key Responsibilities

AI Reasoning Research

Conduct advanced research into reasoning capabilities within large language models (LLMs).
Explore novel approaches to planning, tool usage, and agentic reasoning.
Develop methodologies for improving model reasoning quality and reliability.

LLM Agent Development

Build and evaluate:
- Browser agents
- Coding agents
- GUI agents
- Software engineering agents
- Tool-using AI systems
Design evaluation frameworks for autonomous agent behavior.

Data Strategy & Model Evaluation

Identify optimal data sources for reasoning-focused training and evaluation.
Develop scalable data generation and evaluation pipelines.
Improve AI model benchmarking and reasoning assessment systems.

Research Prototyping

Rapidly implement and test new machine learning ideas.
Translate research papers into production-ready prototypes.
Experiment with novel architectures and reasoning workflows.

Cross-Functional Collaboration

Partner with research scientists, ML engineers, and product teams.
Collaborate with external researchers and academic contributors.
Communicate technical findings clearly across teams.

Required Qualifications: Essential Requirements

Practical experience working with large language models (LLMs).
Strong proficiency in:
- PyTorch
- JAX
- TensorFlow
At least 3 years of experience solving complex ML challenges in:
- Research environments
- Applied AI systems
- Product development
Published research in leading AI/ML conferences such as:
- ACL
- EMNLP
- NAACL
- NeurIPS
- ICML
- ICLR
- CoLLM
Strong understanding of:
- LLM reasoning
- Planning algorithms
- Agentic AI systems
- NLP research methodologies
Excellent written and verbal communication skills.

Preferred Qualifications: Nice-to-Have Experience

Fine-tuning open-source LLMs at scale.
Experience building AI agents using frameworks such as:
- LangGraph
- OpenHands
- Swarm
Familiarity with advanced reasoning methods including:
- STaR
- PLANSEARCH
Experience with:
- Text-to-SQL systems
- Browser automation agents
- Coding assistants
- Tool-use agents
Cloud ML development experience using:
- Amazon Web Services
- Google Cloud

Research Interview Process

Scale AI’s research interviews evaluate:

Machine learning prototyping skills
Model debugging ability
Research depth and reasoning expertise
Cross-functional collaboration
Technical communication

This role does not include LeetCode-style coding assessments.

Compensation & Benefits

Compensation

Base salary range:
- USD $252,000 – $315,000 annually
Compensation may include:
- Equity grants
- Performance incentives
- Comprehensive benefits

Salary depends on:

Experience
Skills
Interview performance
Education
Work location

Benefits Include

Comprehensive health coverage
Dental and vision insurance
Retirement benefits
Learning & development stipend
Generous paid time off
Potential commuter stipend
Equity-based compensation eligibility

Why Join Scale AI?

Work on Frontier AI Problems

Contribute directly to advanced reasoning systems and next-generation LLM agents.

Influence AGI Development

Help define the data and methodologies shaping future artificial intelligence systems.

Collaborate with Elite Researchers

Work alongside world-class researchers, engineers, and AI innovators.

High Research Impact

Publish impactful work while contributing to production-grade AI systems used globally.

Inclusive Workplace

Scale AI is committed to building an inclusive, diverse, and equal opportunity workplace where every employee can thrive.

Apply to Machine Learning Research Scientist, Reasoning.

Related searches: More jobs in San Fransisco · More research jobs · More full-time jobs

Boost your application

AscendurePro members win more interviews with these tools. Free to start, no credit card.

📄 ATS Resume Builder Score-tested resumes that get past applicant tracking systems. ✉️ Cover Letter Builder Tailored to this exact job posting in 60 seconds. 🔗 LinkedIn Optimizer A profile that ranks in recruiter search. 📥 Daily Job Alerts Get AI-matched roles emailed to you daily, free.

🧠 AI Insights for this role

Resume → Job Fit Analysis

Get a fit score, keyword gaps, and specific resume edits tailored to this role.

Check my fit

Likely Interview Questions

Show prep pack ↓

LIKELY QUESTIONS
- How have you improved reasoning performance in LLMs in past work, and what specific techniques did you test, such as fine-tuning, synthetic data generation, STaR-style bootstrapping, search-based methods, or tool-use scaffolding?
- If you were asked to build an evaluation framework for a browser or coding agent at Scale AI, what metrics, task suites, failure taxonomies, and offline versus online evaluation methods would you use?
- Walk me through a research project where you translated a recent paper into a working prototype. What did you implement, what failed initially, and how did you validate the final approach?
- How do you think about the relationship between data quality, reward signals, and model architecture when trying to improve multi-step reasoning reliability?
- Suppose an agent performs well on benchmark tasks but fails unpredictably in real-world deployment. How would you diagnose whether the issue comes from the model, the prompt/policy, the tools, the environment, or the evaluation setup?
- What is your experience with fine-tuning or post-training open-source LLMs at scale, and what training stack, infrastructure, and debugging practices did you use?
- How would you design a scalable pipeline for generating high-quality reasoning data, including task sourcing, annotation, verification, and filtering for hallucinations or shortcut learning?
- Scale values both frontier research and production impact. Can you describe a time when you had to balance scientific rigor with speed of delivery and cross-functional constraints?

BEHAVIOURAL QUESTIONS
- Tell me about a time you disagreed with engineers or product stakeholders on how to evaluate an ML system.
Model approach: Situation - evaluation goals were misaligned across teams; Task - create a rigorous but practical evaluation plan; Action - clarified business risk, defined task-specific metrics and failure modes, proposed a phased eval suite with quick proxy metrics plus deeper audits, aligned stakeholders through written docs and review meetings; Result - team adopted shared evaluation criteria, improved decision speed, and avoided launching with blind spots.

- Describe a project where your initial research direction failed.
Model approach: Situation - early hypothesis on a reasoning method or architecture did not improve target metrics; Task - determine whether to iterate or pivot quickly; Action - ran ablations, checked data quality and implementation bugs, compared against strong baselines, documented negative results, then redirected effort to a better-performing approach; Result - recovered timeline, produced a stronger system or paper, and demonstrated disciplined research judgment.

- Give an example of leading through influence rather than authority.
Model approach: Situation - cross-functional project with researchers, engineers, and annotators or data teams, no direct reporting line; Task - drive consensus on roadmap and technical choices; Action - built credibility with clear experiments, concise memos, and tradeoff analysis, incorporated feedback, and assigned owners around milestones; Result - alignment improved, execution accelerated, and the project shipped or published successfully.

- Tell me about a time you had to explain a complex research finding to a non-specialist audience.
Model approach: Situation - technical result on model reasoning, evaluation, or agent reliability needed stakeholder buy-in; Task - communicate implications clearly without oversimplifying; Action - translated methods into problem-impact language, used visual examples and concrete failure cases, separated confidence from speculation, and recommended next steps; Result - stakeholders understood the decision, approved resources or direction, and trusted the research process.

SMART QUESTIONS TO ASK
- How does Scale currently distinguish progress in benchmark reasoning from progress in real-world agent reliability, and where do you see the biggest evaluation gaps today?
- What are the most important research problems for this team over the next 6 to 12 months: better data, stronger post-training methods, planning/tool use, or more realistic agent benchmarks?
- How are research scientists expected to balance publication-quality work with production delivery, and what does success look like in the first six months?
- What tooling and infrastructure exist today for large-scale data generation, model evaluation, and agent experimentation, and where does the team still face bottlenecks?
- How does collaboration work between research scientists, ML engineers, product teams, and external partners, especially when priorities between frontier research and customer impact diverge?

RED FLAGS TO WATCH FOR
- They cannot clearly explain how success is measured beyond vague statements like "improve reasoning" or "build better agents," suggesting weak research prioritization and unclear evaluation discipline.
- They emphasize speed and shipping but give little detail on dataset quality, annotation rigor, reproducibility, or safety/reliability checks for agent systems.
- Interviewers describe the role as highly research-driven but cannot point to publication support, compute access, ownership boundaries, or how ideas actually move from prototype to deployed system.

Want full STAR-format answers tailored to your background? Use the Interview Simulator.

Adjacent Career Paths

Roles you'd also qualify for based on this posting's requirements:

LLM Research Scientist — This role closely matches the needed experience in large language models, reasoning methods, and published NLP research.
Applied Scientist, AI Agents — The candidate's background in agentic reasoning, tool use, and evaluation fits building practical autonomous AI systems.
Machine Learning Engineer, Generative AI — Their hands-on work with PyTorch, JAX, or TensorFlow and production-oriented prototyping aligns with deploying advanced generative models.
NLP Research Scientist — Strong experience in language modeling, reasoning datasets, and top-tier conference research maps directly to advanced NLP research roles.