Career

Field CTO — AI Infrastructure Performance

Fully Remote (Global)

About Zymtrace

Organizations spend billions on GPU infrastructure to power AI, yet roughly 60–65% of that investment is lost to underutilized hardware, idle cycles, and inefficient workloads. The problem is not the teams running these systems. It is that existing tools treat GPUs as black boxes, exposing surface-level metrics without revealing where performance is actually being lost.

Zymtrace is a distributed AI infrastructure optimization platform that provides deep, continuous visibility into general-purpose and GPU-accelerated workloads across entire clusters. We trace execution from PyTorch and JAX code through CUDA kernel launches all the way down to individual GPU instructions and stall reasons, then correlate everything back to the CPU code that triggered it. Zero code changes. No guesswork.

We work with leading AI labs, Fortune 500 companies, and research firms to debug and optimize their most demanding workloads. Read the anam.ai case study.

Our founders helped pioneer, open-source, and contribute the eBPF continuous CPU profiler to OpenTelemetry. That technology is now used in production across Grafana, Datadog, IBM, Cisco, and others. At zymtrace, we are applying that same systems-level engineering approach to GPU workloads.

About the Role

You will serve as the technical bridge between customers, engineering, and product.

A Field CTO — AI Infrastructure Performance combines deep systems expertise with leadership. You will work directly with teams running large-scale AI workloads, helping them understand where their infrastructure is losing performance and how to fix it.

Your work in the field will have a direct impact on the product roadmap. You will identify patterns across deployments, surface the hardest problems customers face, and help shape the capabilities we build next.

You will also act as a trusted advisor to customers and a technical voice for zymtrace within the broader AI infrastructure ecosystem.

Key Responsibilities

  • Lead complex GPU and AI infrastructure performance investigations with customers
  • Work alongside engineers to analyze profiling data and identify root causes of performance bottlenecks
  • Translate field insights into clear product direction and roadmap priorities
  • Help customers architect more efficient training and inference systems
  • Develop optimization playbooks and internal frameworks that scale performance expertise across the team
  • Partner closely with engineering to validate new capabilities against real-world workloads
  • Represent zymtrace in technical engagements with customers, partners, and the broader AI infrastructure community

What We’re Looking For

You should be familiar with most of the following:

  • GPU workloads such as CUDA kernels, memory access patterns, kernel fusion, or distributed training performance
  • GPU hardware fundamentals including SM utilization, memory hierarchies, warp scheduling, and stall reasons
  • Profiling tools and performance diagnostics such as trace analysis, hardware counters, and performance telemetry
  • AI/ML frameworks such as PyTorch, JAX, or similar
  • HPC and distributed systems concepts including MPI, NCCL, or multi-node GPU clusters
  • NVIDIA performance tooling such as Nsight Compute, Nsight Systems, DCGM, or NVML
  • Inference runtimes such as vLLM, SGLang, or TensorRT-LLM
  • Communicating complex performance concepts clearly to engineers and technical leaders
  • Working in environments where customer interaction and engineering feedback loops are closely connected

Why Join Zymtrace?

  • Work at the frontier. GPU optimization is one of the hardest and most consequential problems in modern computing. The customers you work with are running some of the most demanding AI workloads in the world.
  • Shape the product from the field. Your insights from real workloads will directly influence the product roadmap.
  • World-class teammates. You will work alongside engineers who built the eBPF profiler now used across the OpenTelemetry ecosystem, created disassemblers used in Firefox and WebKit, and previously worked on compilers and kernels at companies like Google.
  • Real customer impact. Our customers include leading AI labs and Fortune 500 companies. The optimizations you surface will translate into faster models, lower costs, and millions of GPU-hours reclaimed.

Benefits

  • Competitive salary and meaningful equity
  • 401(k) plan
  • Comprehensive health, dental, and vision insurance
  • Remote-first company (may require travel to customer sites and conferences)
  • Annual learning and development budget

If you enjoy helping teams understand and optimize complex AI infrastructure, we would love to hear from you.

Refs

Zymtrace launch blog post

About the team, our investors and advisors