Senior ML Engineer, GPU Performance

Career

Senior ML Engineer, GPU Performance

Fully Remote (Global)

About zymtrace

Organizations spend billions on GPU infrastructure to power AI, yet roughly 60–65% of that investment is lost to underutilized hardware, idle cycles, and inefficient workloads. The problem isn’t the teams running them. Existing tools treat GPUs as black boxes, showing surface-level metrics without revealing where the waste actually lives.

zymtrace is a distributed AI infrastructure optimization platform that gives customers deep, always-on visibility into GPU-accelerated workloads across entire clusters. We trace from PyTorch and JAX code through CUDA kernels down to individual GPU instructions and stall reasons, then correlate everything back to the CPU code that triggered it. Zero code changes. Zero guesswork.

We work with leading AI labs, Fortune 500 companies, and research firms to debug and optimize their most demanding workloads.

Our founders pioneered, open-sourced, and contributed the eBPF continuous CPU profiler to OpenTelemetry, now used in production across Grafana, Datadog, IBM, Cisco, and others.

About the Role

You will be part of the team building an AI agent that autonomously optimizes GPU workloads across the full stack, from user-space Python down to GPU kernel fine-tuning. That spans Python (PyTorch/JAX), Rust, C++, GPU kernels, and hardware primitives.

That means working across the entire execution path: PyTorch code, CUDA dispatch, kernel behavior, memory access patterns, and stall reasons. Taking real profiling data, identifying what is slow and why, generating fixes, and verifying the result.

Profiling data and metrics are the context that guide every decision the agent makes. That is what separates this from generic AI optimization: the agent is not guessing, it is using ground truth for optimization decisions.

We have a dedicated team of low-level engineers who efficiently generate that data using a combination of eBPF and other low-level instrumentation techniques. Your role is to take that rich raw telemetry as context and build the best-in-class autonomous optimization layer on top of it. When needed, you will jump on calls with customers to validate findings.

You might be a good fit if you

Have built agentic or automated research pipelines: systems that gather context, reason over it, and produce actionable output with minimal human input
Understand ML performance well enough to know what a bottleneck looks like and what class of fix is likely to help, including a solid understanding of CUDA: kernel execution, memory hierarchy, and how GPU workloads behave at the hardware level
Have worked with LLMs as reasoning engines, not just generation tools: tool use, multi-step planning, verification loops
Understand why profile-guided optimization is fundamentally different from heuristic or static analysis approaches, and can build systems that use real execution data as the source of truth
Care about correctness and verification: an optimization that can’t be confirmed to work is not an optimization

Strong candidates may also have experience with

Agentic frameworks: building multi-step agents with tool use, reflection, or self-correction loops
GPU performance: CUDA, PyTorch internals, kernel optimization patterns, or distributed training bottlenecks
Automated code generation and patching: applying LLM-generated changes safely in real codebases
Profiling and tracing systems: OpenTelemetry, eBPF, flamegraphs, or GPU-specific profiling formats
Evaluation and benchmarking: designing repeatable tests to measure whether an optimization actually worked

Why Join zymtrace?

The problem is wide open. Autonomous GPU optimization is largely unsolved. You will not be implementing a spec. You will be figuring out what the right approach even is.
You will shape the product. This capability is central to where zymtrace is going. What you build will be core to the product.
Your work ships fast. No committees. You will see your impact in production quickly, get direct feedback, and iterate from there.
World-class teammates. You will work alongside engineers who built the eBPF profiler now used across the OpenTelemetry ecosystem, created disassemblers used in Firefox and WebKit, and previously worked on compilers and kernels at companies like Google.

Benefits

Competitive salary and meaningful equity
401(k) plan (or your country’s equivalent)
Comprehensive health insurance
Remote-first company

Refs

zymtrace launch blog post

GPU economics blog post

About the team, our investors and advisors