Cluster-wide GPU Observability

GPU Profiling

Maximize throughput per GPU per watt

Identify performance bottlenecks in CUDA kernels, optimize inference batch size, and eliminate idle GPU cycles with zero friction. Get deep cluster-wide visibility into CPU⇄GPU interactions that traditional tools miss.

Do more with fewer GPUs

Start Profiling →

Click to expand

Drill-down from a deep in power usage to the GPU profiles

Platform Support

Supported AI/ML Platforms

zymtrace works seamlessly with your existing GPU and ML infrastructure across all major frameworks

Key Use Cases

Maximize GPU Efficiency

Debug performance issues faster and do more with fewer GPUs

Underutilized GPUs lead to longer training cycles, costly inference, and wasted energy. zymtrace pinpoints inefficiencies by profiling CUDA kernels, disassembling SASS mnemonics, exposing GPU stall reasons, and correlating them back to the CPU traces that launched them.

Optimize CUDA Kernels

Identify kernel fusion opportunities and eliminate redundant operations. 300% speed-ups possible.

Kernel Launch Analysis

Find Optimal Batch Size

Discover the "Sweet Spot" between memory-bound and compute-bound performance.

Inference Batch Optimization

Fix GPU Utilization

Detect GPU stalls and CPU bottlenecks with end-to-end visibility across your pipeline.

Stall Reason Analysis

Detect Performance Patterns

Start with monitoring, then drill down into detailed profiles.

Real-Time GPU Monitoring

Inference Profiling

Cut Inference Costs

Inference runs 24/7 in production—small inefficiencies become massive costs. Find the optimal batch size, eliminate GPU stalls, and maximize throughput per dollar by optimizing your inference pipelines on both CPU and GPU.

Up to 10x

Throughput improvement

20-40%

Cost reduction

Better

GPU utilization

Supported Inference Engines

vLLM

Ollama

llama.cpp

SGLang

+ Any CUDA-enabled application, including Python frameworks that use CUDA

The Architecture

How the zymtrace GPU profiler works

Built from the ground-up as a continuous profiler for heterogeneous workloads. Correlates CPU and GPU execution across your entire cluster, providing unified visibility into distributed workloads.

zymtrace GPU profiler architecture

Launch your CUDA application

Launch your CUDA-enabled AI workloads with the zymtrace GPU profiler enabled. No code changes required

eBPF-Based Profiling

The zymtrace profiler detects CUDA launches and places uprobes on exported functions. These uprobes invoke CPU stack unwinders using the same eBPF technology that powers our CPU profiling, capturing the complete call stack from user code to kernel launch.

GPU Instruction Sampling

Our CUDA profiler samples high-granularity information about GPU instructions (SASS) running on compute cores and identifies stall reasons that prevent kernels from making progress. This data reveals exactly why kernels are slow at the microarchitectural level.

Unified Stack Trace Generation

Stack traces from Host and GPU execution data are merged in the zymtrace profiler, creating unified traces that span from user-mode PyTorch/JAX code through CUDA runtime down to individual GPU instructions and stall reasons. These are visualized as interactive flamegraphs in our UI.

Learn more about the architecture →

Compliant

zymtrace is OpenTelemetry compliant, including support for OTEL resource attributes. Collect, process, and export profiling data in the standard OpenTelemetry format for seamless integration with your existing observability stack.

Fun Fact

The zymtrace team were part of the team that pioneered, open-sourced, and donated the eBPF profiler to OpenTelemetry. With zymtrace, we're extending that same low-level engineering excellence to GPU-bound workloads and building a highly scalable profiling platform purpose-built for today's distributed, heterogeneous environments — spanning both general-purpose and AI-accelerated workloads.

View OpenTelemetry eBPF Profiler

FAQ

Frequently Asked Questions

I already have metrics, logs and/or tracing. Why do I need to profile?

Metrics, logs, and traces are analogous to measuring and monitoring the vital signs of the human body — they provide general information about health and performance, such as body temperature, weight, and heart rate, including records of events leading to symptoms. But zymtrace GPU profiler is like taking an X-ray, or better, an MRI scan — it allows you to see the inner workings of the body and understand how different systems interact, giving more detailed information and potentially identifying issues that would not be visible just by looking at macro-level indicators. Further, GPU profiling provides unprecedented breadth and depth of visibility that unlocks the ability to surface unknown-unknowns of your CUDA workloads. This deeper level of system-wide visibility enables users to ditch the guesswork; it opens up the ability to quickly get to the heart of the "why" questions –– why are we spending x% of our GPU cycles on kernel y? Why is memory bandwidth underutilized? What CUDA kernels are consuming the most power across our entire GPU fleet?

What makes zymtrace different from NVIDIA Nsight or other GPU profilers?

Unlike heavyweight tools like Nsight Compute that profile single workstations, zymtrace provides continuous, always-on GPU profiling across your entire cluster with zero friction. Combined with our GPU monitoring capabilities, you get both high-level infrastructure visibility and deep kernel-level analysis—at scale. We build complete stack traces from PyTorch/JAX code down to SASS instructions, correlating CPU and GPU execution seamlessly in production environments.

How does zymtrace correlate CPU and GPU execution?

zymtrace uses eBPF-based profiling combined with CUDA profiling to capture kernel launches and completions. We place uprobes on CUDA functions and merge stack traces with GPU instruction samples, creating unified traces that span from user-mode code to individual GPU instructions and stall reasons.

Can zymtrace identify specific kernel optimization opportunities?

Yes, zymtrace identifies kernel fusion opportunities, inefficient memory transfer patterns, and optimal batch sizes. We've helped users achieve up to 30% performance improvements, with one case showing a 300% speed-up by identifying torch.compile optimization opportunities.

Does zymtrace work in air-gapped or on-premises environments?

Absolutely. zymtrace runs fully on-premises and works seamlessly in air-gapped environments. All processing and analysis are performed locally, with no data leaving your infrastructure. Setup takes just five minutes using standard deployment options.

What GPU architectures and frameworks are supported?

zymtrace currently focuses on NVIDIA GPUs with CUDA support, including data center GPUs (H100s, A100s) and consumer cards. We support PyTorch, JAX, and custom CUDA applications. For inference workloads, we support vLLM, Ollama, llama.cpp, and other CUDA-based inference engines. If you need AMD or TPU support, please contact us.

Ready to Optimize Your GPU Workloads?

Get started with GPU profiling in minutes and unlock the full potential of your hardware.

Start Free Trial