Cluster-wide GPU Observability

GPU Profiling

Maximize throughput per GPU per watt

Identify performance bottlenecks in CUDA kernels, optimize inference batch size, and eliminate idle GPU cycles with zero friction. Get deep cluster-wide visibility into CPU⇄GPU interactions that traditional tools miss.

Do more with fewer GPUs

zymtrace AI flamegraph showing DeepSeek-R1-Distill-Qwen inference via vLLM

End-to-end GPU profiling: DeepSeek-R1 inference via vLLM with CPU-to-GPU correlation

Click to expand
Platform Support

Supported AI/ML Platforms

zymtrace works seamlessly with your existing GPU and ML infrastructure across all major frameworks

JAX Light Stroke
Key Use Cases

Maximize GPU Efficiency

Debug performance issues faster and do more with fewer GPUs

Underutilized GPUs lead to longer training cycles, costly inference, and wasted energy. zymtrace pinpoints inefficiencies by profiling CUDA kernels, disassembling SASS mnemonics, exposing GPU stall reasons, and correlating them back to the CPU traces that launched them.

Optimize CUDA Kernels

Identify kernel fusion opportunities and eliminate redundant operations. 300% speed-ups possible.

Kernel Launch Analysis

Find Optimal Batch Size

Discover the "Sweet Spot" between memory-bound and compute-bound performance.

Inference Batch Optimization

Fix GPU Utilization

Detect GPU stalls and CPU bottlenecks with end-to-end visibility across your pipeline.

Stall Reason Analysis

Detect Performance Patterns

Start with monitoring, then drill down into detailed profiles.

Real-Time GPU Monitoring
Inference Profiling

Cut Inference Costs

Inference runs 24/7 in production—small inefficiencies become massive costs. Find the optimal batch size, eliminate GPU stalls, and maximize throughput per dollar by optimizing your inference pipelines on both CPU and GPU.

Up to 10x
Throughput improvement
20-40%
Cost reduction
Better
GPU utilization

Supported Inference Engines

vLLM Logo

vLLM

Ollama Logo

Ollama

llama.cpp Logo

llama.cpp

SGLang Logo

SGLang

+ Any CUDA-enabled application, including Python frameworks that use CUDA

The Architecture

How the zymtrace GPU profiler works

Built from the ground-up as a continuous profiler for heterogeneous workloads. Correlates CPU and GPU execution across your entire cluster, providing unified visibility into distributed workloads.

zymtrace GPU profiler architecture diagram showing CUDA injection and eBPF correlation

zymtrace GPU profiler architecture

1

Launch your CUDA application

Launch your CUDA enabled AI workloads with the zymtrace GPU profiler enabled. No code changes required

2

eBPF-Based Profiling

The zymtrace profiler detects CUDA launches and places uprobes on exported functions. These uprobes invoke CPU stack unwinders using the same eBPF technology that powers our CPU profiling, capturing the complete call stack from user code to kernel launch.

3

GPU Instruction Sampling

Our CUDA profiler samples high-granularity information about GPU instructions (SASS) running on compute cores and identifies stall reasons that prevent kernels from making progress. This data reveals exactly why kernels are slow at the microarchitectural level.

4

Unified Stack Trace Generation

Stack traces from Host and GPU execution data are merged in the zymtrace profiler, creating unified traces that span from user-mode PyTorch/JAX code through CUDA runtime down to individual GPU instructions and stall reasons. These are visualized as interactive flamegraphs in our UI.

OpenTelemetry

OpenTelemetry Compliant

zymtrace is OpenTelemetry compliant, including support for OTEL resource attributes.

Fun Fact

The zymtrace team were part of the team that pioneered, open-sourced, and donated the eBPF profiler to OpenTelemetry. With zymtrace, we’re extending that same low-level engineering excellence to GPU-bound workloads and building a highly scalable profiling platform purpose-built for today’s distributed, heterogeneous environments — spanning both general-purpose and AI-accelerated workloads.

FAQ

Frequently Asked Questions

Metrics, logs, and traces are analogous to measuring and monitoring the vital signs of the human body — they provide general information about health and performance, such as body temperature, weight, and heart rate, including records of events leading to symptoms. But zymtrace GPU profiler is like taking an X-ray, or better an MRI scan — it allows you to see the inner workings of the body and understand how different systems interact, giving more detailed information and potentially identifying issues that would not be visible just by looking at macro-level indicators. Further, GPU profiling provides unprecedented breadth and depth of visibility that unlocks the ability to surface unknown-unknowns of your CUDA workloads. This deeper level of system-wide visibility enables users to ditch the guesswork; it opens up the ability to quickly get to the heart of the "why" questions –– why are we spending x% of our GPU cycles on kernel y? Why is memory bandwidth underutilized? What CUDA kernels are consuming the most power across our entire GPU fleet?
Unlike heavyweight tools like Nsight Compute that profile single workstations, zymtrace provides continuous, always-on GPU profiling across your entire cluster with zero friction. Combined with our GPU monitoring capabilities, you get both high-level infrastructure visibility and deep kernel-level analysis—at scale. We build complete stack traces from PyTorch/JAX code down to SASS instructions, correlating CPU and GPU execution seamlessly in production environments
zymtrace uses eBPF-based profiling combined with CUDA profiling to capture kernel launches and completions. We place uprobes on CUDA functions and merge stack traces with GPU instruction samples, creating unified traces that span from user-mode code to individual GPU instructions and stall reasons.
Yes, zymtrace identifies kernel fusion opportunities, inefficient memory transfer patterns, and optimal batch sizes. We've helped users achieve up to 30% performance improvements, with one case showing a 300% speed-up by identifying torch.compile optimization opportunities.
Absolutely. zymtrace runs fully on-premises and works perfectly in air-gapped environments. All processing and analysis happens locally, with no data leaving your infrastructure. Setup takes just 5 minutes with standard deployment options.
zymtrace currently focuses on NVIDIA GPUs with CUDA support, including data center GPUs (H100s, A100s) and consumer cards. We support PyTorch, JAX, and custom CUDA applications. For inference workloads, we support vLLM, Ollama, llama.cpp, and other CUDA-based inference engines. If you need AMD or TPU support, please contact us.

Ready to Optimize Your GPU Workloads?

Get started with GPU profiling in minutes and unlock the full potential of your hardware.

Start Free Trial