Inference & Fine-tuning Engineer (Models, Performance)

Career

Inference & Fine-tuning Engineer (Models, Performance)

Fully Remote (Global)

About Zymtrace

Organizations spend billions on GPU infrastructure to power AI, yet roughly 60–65% of that investment is lost to underutilized hardware, idle cycles, and inefficient workloads. The problem is not the teams running these systems. It is that most tooling treats GPUs as black boxes, exposing surface-level metrics without revealing where performance is actually being lost.

Zymtrace is a distributed AI infrastructure optimization platform that provides deep, continuous visibility into general-purpose and GPU-accelerated workloads across entire clusters. We trace execution from PyTorch and JAX code through CUDA kernel launches all the way down to individual GPU instructions and stall reasons, then correlate everything back to the CPU code that triggered it. Zero code changes. No guesswork.

We work with leading AI labs, Fortune 500 companies, and research firms to debug and optimize their most demanding workloads. Read the anam.ai case study.

Our founders helped pioneer, open-source, and contribute the eBPF continuous CPU profiler to OpenTelemetry. That technology is now used in production across Grafana, Datadog, IBM, Cisco, and others. At Zymtrace, we are applying that same systems-level engineering approach to GPU workloads.

We are a team of kernel hackers and systems programmers working at the deepest layers of the stack: GPUs, CUDA runtimes, eBPF, compilers, and instruction-level performance analysis. By joining Zymtrace, you will work at the frontier of modern computing, helping organizations optimize AI training and inference workloads at massive scale.

About the Role

Do you enjoy working close to the hardware and optimizing model serving stacks?

This role sits at the intersection of model performance engineering and systems programming.

We are looking for an Inference and Fine-tuning Engineer to own the inference performance layer at Zymtrace. You will benchmark and optimize inference engines, run fine-tuning experiments, and work directly with customers to reproduce and understand real-world workloads where significant performance headroom remains.

You will work closely with our profiling technology to close the loop between what Zymtrace surfaces and the concrete optimizations required in the model, runtime, or serving stack. You will also have direct input into the roadmap for inference observability and optimization features.

Key Responsibilities

Benchmark, profile, and optimize inference pipelines across engines including SGLang, vLLM, and NVIDIA Dynamo-Triton
Run fine-tuning experiments and evaluate model performance under realistic production constraints
Identify bottlenecks across the inference stack, including batching strategies, KV cache pressure, memory bandwidth, and CUDA kernel efficiency
Work directly with customers to reproduce and diagnose performance issues using Zymtrace profiling data
Collaborate with engineering to improve Zymtrace’s inference observability features based on real workload patterns
Document optimization techniques, performance findings, and best practices for the broader team and community

Example Problems You Might Work On

A customer is running vLLM on an H100 cluster but achieving only 40% GPU utilization. Using Zymtrace flamegraphs and stall analysis, determine whether the bottleneck is KV cache pressure, scheduling inefficiencies, or CUDA kernel launch overhead.
Diagnose why a distributed inference service shows large latency spikes under burst traffic. Determine whether the root cause lies in batching behavior, GPU memory fragmentation, or CPU-side scheduling.
Benchmark the performance differences between vLLM, SGLang, and TensorRT-LLM on the same model and hardware configuration, identifying where each runtime leaves performance on the table.
Investigate GPU stall reasons exposed by Zymtrace (memory dependency, warp scheduling, memory bandwidth) and translate those signals into concrete model or runtime optimizations.
Reproduce a customer workload locally and experiment with optimizations such as Flash Attention, KV cache strategies, or quantization methods (FP8, INT4) to improve tokens/sec throughput.
Use Zymtrace traces to correlate CPU scheduling behavior with GPU kernel launch gaps and propose improvements in the serving stack.

What We’re Looking For

You should be familiar with most of the following:

Running or optimizing LLM inference workloads (model serving, batching, quantization, throughput tuning)
Inference runtimes such as vLLM, SGLang, TensorRT-LLM, or NVIDIA Triton
GPU execution and memory hierarchies, and common inference bottlenecks
PyTorch and familiarity with JAX or other ML frameworks
Reading profiling data and flamegraphs to diagnose issues across the CPU–GPU boundary
Performance tooling such as Nsight Systems, DCGM, or NVML
Distributed GPU systems (NCCL, NVLink, tensor or pipeline parallelism)
Model optimization techniques such as LoRA, QLoRA, or quantization (GPTQ, AWQ, FP8, INT4)
Systems engineering, HPC, or performance-focused software development
Low-overhead production profiling approaches such as eBPF

Why Join Zymtrace?

Work at the frontier. GPU optimization is one of the hardest and most consequential problems in modern computing. You will help customers extract maximum performance from the hardware powering modern AI.
Shape the company. This is an early-stage team. Your work will directly influence the product, inference observability roadmap, and engineering culture.
World-class teammates. You will work alongside engineers who built the eBPF profiler now used across the OpenTelemetry ecosystem, created disassemblers used in Firefox and WebKit, and previously worked on compilers and kernels at companies like Google.
Real customer impact. Our customers include leading AI labs and Fortune 500 companies. The work you do will directly translate into faster models, lower costs, and more efficient infrastructure.

Benefits

Competitive salary and meaningful equity
401(k) plan
Comprehensive health, dental, and vision insurance
Remote-first company (with occasional travel)
Annual learning and development budget

If you enjoy understanding why GPUs stall and turning that insight into real performance wins, we would love to hear from you.

Refs

About the team, our investors and advisors