Field CTO — AI Infrastructure Performance

Career

Field CTO — AI Infrastructure Performance

Fully Remote (Global)

About Zymtrace

Organizations spend billions on GPU infrastructure to power AI, yet roughly 60–65% of that investment is lost to underutilized hardware, idle cycles, and inefficient workloads. The problem is not the teams running these systems. It is that existing tools treat GPUs as black boxes, exposing surface-level metrics without revealing where performance is actually being lost.

Zymtrace is a distributed AI infrastructure optimization platform that provides deep, continuous visibility into general-purpose and GPU-accelerated workloads across entire clusters. We trace execution from PyTorch and JAX code through CUDA kernel launches all the way down to individual GPU instructions and stall reasons, then correlate everything back to the CPU code that triggered it. Zero code changes. No guesswork.

We work with leading AI labs, Fortune 500 companies, and research firms to debug and optimize their most demanding workloads. Read the anam.ai case study.

Our founders helped pioneer, open-source, and contribute the eBPF continuous CPU profiler to OpenTelemetry. That technology is now used in production across Grafana, Datadog, IBM, Cisco, and others. At zymtrace, we are applying that same systems-level engineering approach to GPU workloads.

About the Role

You will serve as the technical bridge between customers, engineering, and product.

A Field CTO — AI Infrastructure Performance combines deep systems expertise with leadership. You will work directly with teams running large-scale AI workloads, helping them understand where their infrastructure is losing performance and how to fix it.

Your work in the field will have a direct impact on the product roadmap. You will identify patterns across deployments, surface the hardest problems customers face, and help shape the capabilities we build next.

You will also act as a trusted advisor to customers and a technical voice for zymtrace within the broader AI infrastructure ecosystem.

Key Responsibilities

Lead complex GPU and AI infrastructure performance investigations with customers
Work alongside engineers to analyze profiling data and identify root causes of performance bottlenecks
Translate field insights into clear product direction and roadmap priorities
Help customers architect more efficient training and inference systems
Develop optimization playbooks and internal frameworks that scale performance expertise across the team
Partner closely with engineering to validate new capabilities against real-world workloads
Represent zymtrace in technical engagements with customers, partners, and the broader AI infrastructure community

What We’re Looking For

You should be familiar with most of the following:

GPU workloads such as CUDA kernels, memory access patterns, kernel fusion, or distributed training performance
GPU hardware fundamentals including SM utilization, memory hierarchies, warp scheduling, and stall reasons
Profiling tools and performance diagnostics such as trace analysis, hardware counters, and performance telemetry
AI/ML frameworks such as PyTorch, JAX, or similar
HPC and distributed systems concepts including MPI, NCCL, or multi-node GPU clusters
NVIDIA performance tooling such as Nsight Compute, Nsight Systems, DCGM, or NVML
Inference runtimes such as vLLM, SGLang, or TensorRT-LLM
Communicating complex performance concepts clearly to engineers and technical leaders
Working in environments where customer interaction and engineering feedback loops are closely connected

Why Join Zymtrace?

Work at the frontier. GPU optimization is one of the hardest and most consequential problems in modern computing. The customers you work with are running some of the most demanding AI workloads in the world.
Shape the product from the field. Your insights from real workloads will directly influence the product roadmap.
World-class teammates. You will work alongside engineers who built the eBPF profiler now used across the OpenTelemetry ecosystem, created disassemblers used in Firefox and WebKit, and previously worked on compilers and kernels at companies like Google.
Real customer impact. Our customers include leading AI labs and Fortune 500 companies. The optimizations you surface will translate into faster models, lower costs, and millions of GPU-hours reclaimed.

Benefits

Competitive salary and meaningful equity
401(k) plan
Comprehensive health, dental, and vision insurance
Remote-first company (may require travel to customer sites and conferences)
Annual learning and development budget

If you enjoy helping teams understand and optimize complex AI infrastructure, we would love to hear from you.

Refs

Zymtrace launch blog post

About the team, our investors and advisors