Senior Software Engineer, At Scale Compute Analysis — NVIDIA

Senior Software Engineer, At Scale Compute Analysis — NVIDIA

NVIDIA · Santa Clara · Remote

No longer accepting applications
  • Type: FULL-TIME
  • Salary: USD 152,000 – 241,500/yr
  • Posted: 2 weeks ago
This role is no longer accepting applications. Browse other open roles, or get started with our career tools below.

Job Description

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years — a legacy of innovation fuelled by great technology and exceptional people. Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing, where GPUs act as the brains of computers, robots, and self-driving cars.

This role sits within a team that analyses large-scale datacenter workloads on GPU-accelerated clusters. You will turn telemetry and workload data into clear findings and visuals, partner with OS, container, GPU, and systems engineers, and apply machine learning and deep learning techniques for categorisation and forecasting — all coordinated into tools the team actively uses.

Responsibilities

  • Analyse large-scale workloads and infrastructure signals to identify application and platform improvement opportunities
  • Work with high-dimensional data: spot trends, tie changes to known events, summarise conclusions, and communicate results to engineers and leadership
  • Partner with the team to clarify questions, scope analyses, and document methods so others can extend your work
  • Build and maintain practical visualisations and lightweight ML/DL implementations (classification/prediction) inside existing software workflows

Requirements

What we need to see:

  • 5+ years analysing complex datasets, debugging data issues, and communicating trends clearly
  • BS or MS in Engineering, Mathematics, Physics, Computer Science, or equivalent experience
  • Strong proficiency in Python and JavaScript
  • Comfortable owning an analysis end-to-end
  • Hands-on experience with telemetry/observability stacks (e.g. Grafana, Elasticsearch, Splunk)
  • Demonstrated grasp of core ML concepts; quick learner with strong analytical and problem-solving skills
  • Strong collaboration and communication skills

Ways to stand out:

  • Experience with TensorFlow or PyTorch
  • Familiarity with Linux and HPC / large-scale or performance-sensitive environments
  • Experience visualising high-dimensional problems
  • Diligent, action-biased analysis style

What Is Offered

  • Salary: $152,000 – $241,500 USD base (determined by location, experience, and role benchmarks)
  • Eligibility for equity and additional benefits
  • Comprehensive health care coverage including dental and vision
  • 401(K) with company matching and after-tax contributions
  • Employee Stock Purchase Program (ESPP)
  • Employee Assistance Program (EAP)
  • Company-paid holidays, paid sick leave, vacation leave, and professional time off
  • Life and disability protection

How to Apply

Apply online before May 16, 2026. This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.


 

NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. NVIDIA does not discriminate on the basis of race, religion, colour, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

Boost your application

AscendurePro members win more interviews with these tools. Free to start, no credit card.

🧠 AI Insights for this role

Resume → Job Fit Analysis

Get a fit score, keyword gaps, and specific resume edits tailored to this role.

Check my fit

Likely Interview Questions

Show prep pack ↓
LIKELY QUESTIONS

- Walk me through a recent end-to-end analysis you owned on a large-scale system, from framing the question to influencing an engineering decision.
- How have you worked with telemetry or observability data from tools like Grafana, Elasticsearch, or Splunk to identify performance or reliability issues?
- This role involves high-dimensional workload data. How do you approach finding meaningful trends, separating signal from noise, and tying changes back to known events?
- Describe a time you uncovered a data quality or instrumentation problem during analysis. How did you debug it, validate the fix, and prevent recurrence?
- How would you design a lightweight classification or forecasting solution for datacenter workload behaviour that engineers would actually trust and use?
- Tell me about your experience using Python and JavaScript together to build analysis workflows, visualisations, or internal tools.
- How have you collaborated with OS, container, GPU, or infrastructure engineers when your analysis challenged assumptions or required changes from multiple teams?
- What experience do you have with Linux, HPC, or performance-sensitive environments, and how has that shaped the way you analyse systems data?

BEHAVIOURAL QUESTIONS

- Tell me about a time you had to make sense of ambiguous or incomplete telemetry data and still produce a useful recommendation.
Model approach:
- Situation: Analysis request had unclear scope and inconsistent instrumentation across services or clusters.
- Task: Produce a reliable view of workload behaviour and recommend next actions despite imperfect data.
- Action: Clarified the decision to support, mapped available signals, documented assumptions, triangulated across multiple sources, flagged confidence levels, and proposed instrumentation fixes.
- Result: Delivered actionable findings, avoided overclaiming, improved telemetry coverage, and enabled a follow-on performance or reliability improvement.

- Describe a time you influenced engineers or leadership using data visualisation and storytelling.
Model approach:
- Situation: Teams had competing theories about a performance regression or capacity issue.
- Task: Present complex, high-dimensional data in a way that drove alignment and action.
- Action: Reduced the problem to a few decision-relevant metrics, built clear visuals showing before/after and cohort differences, linked changes to events or releases, and tailored communication to technical and leadership audiences.
- Result: Stakeholders aligned on root cause or next experiment, work was prioritised, and measurable improvement followed.

- Tell me about a time you found that the original question being asked was not the right one.
Model approach:
- Situation: Team asked for a narrow analysis, but early exploration suggested a different underlying issue.
- Task: Reframe the problem without losing stakeholder trust or momentum.
- Action: Shared evidence early, proposed a sharper hypothesis, scoped a quick validation path, and documented why the revised question better matched the business or engineering decision.
- Result: Team avoided wasted effort, focused on the real bottleneck, and analysis led to a more valuable outcome.

- Give me an example of when you built something practical, not perfect, that became widely used by a team.
Model approach:
- Situation: Engineers needed faster access to workload insights, but existing analysis was manual or fragmented.
- Task: Create a lightweight tool, model, or dashboard embedded in the team workflow.
- Action: Prioritised the highest-value use case, kept the implementation simple, integrated with existing data sources and tools, added documentation, gathered user feedback, and iterated based on adoption.
- Result: Team saved time, analysis became repeatable, usage grew, and the tool informed regular debugging, planning, or forecasting decisions.

SMART QUESTIONS TO ASK

- What are the most important workload or infrastructure questions this team hopes this hire will answer in the first 6 to 12 months?
- How does this team currently partner with OS, container, GPU, and systems engineers, and where do analyses most often get stuck between teams?
- What telemetry stack and data sources are most central today, and where are the biggest gaps in instrumentation or data quality?
- When you say lightweight ML or DL inside existing workflows, what does successful production use look like here: offline decision support, near-real-time classification, forecasting for capacity, or something else?
- How do you measure success for this role: adoption of tools, quality of insights, performance improvements, forecasting accuracy, or influence on engineering roadmap decisions?

RED FLAGS TO WATCH FOR

- They cannot clearly explain what decisions this role supports, who the main stakeholders are, or what success looks like beyond "analyse data."
- They describe poor data quality, fragmented ownership, or missing telemetry but show no plan, investment, or appetite to improve the foundations.
- They emphasise advanced ML or flashy dashboards, but cannot point to real engineering adoption, operational impact, or cross-functional support for acting on findings.

Want full STAR-format answers tailored to your background? Use the Interview Simulator.

Adjacent Career Paths

Roles you'd also qualify for based on this posting's requirements:

  • Senior Data Scientist, Infrastructure Analytics — This role matches the mix of large-scale telemetry analysis, ML-based forecasting, and clear communication to technical stakeholders.
  • Performance Engineer, GPU or HPC Systems — The job's focus on datacenter workloads, Linux/HPC environments, and identifying platform improvement opportunities aligns well with performance engineering.
  • Observability Engineer — Hands-on experience with Grafana, Elasticsearch, Splunk, and turning infrastructure signals into actionable insights maps directly to observability work.
  • Machine Learning Engineer, Applied Analytics — Building lightweight ML/DL models inside production workflows with Python and collaboration across engineering teams is a strong fit for applied ML engineering.

Explore career paths in chat →

×