GPU Monitoring

Unified GPU Infra Observability

Get a complete performance view of your GPU infrastructure at a glance. Monitor GPU memory usage, power draw, SM occupancy, SM efficiency, NVLink throughput, and PCIe health across your entire cluster.

Start with metrics to form hypotheses, then click data points to dive into detailed GPU profiles.

Start Monitoring →

Click to expand

Real-time GPU monitoring dashboard with health metrics and performance insights

Unified GPU Infrastructure Observability

From GPU Monitoring to CUDA Insights

GPU metrics alone stop short—they tell you WHAT is happening, not WHY. zymtrace combines monitoring with deep CUDA profiling to close the loop from detection to optimization.

Monitor Metrics

Track GPU utilization, memory usage, power consumption, and hardware health across your infrastructure. Identify performance trends and potential bottlenecks.

Form Hypotheses

Use process-level breakdowns to identify which applications consume the most resources. Spot anomalies in utilization patterns or memory pressure issues.

Deep Dive Profiles

Click on interesting data points to navigate to detailed GPU profiles. Analyze CUDA kernels, instruction-level execution, and optimization opportunities.

Learn About GPU Profiling →

Compliant

zymtrace is OpenTelemetry compliant, including support for OTEL resource attributes. Collect, process, and export profiling data in the standard OpenTelemetry format for seamless integration with your existing observability stack.

Fun Fact

The zymtrace team were part of the team that pioneered, open-sourced, and donated the eBPF profiler to OpenTelemetry. With zymtrace, we're extending that same low-level engineering excellence to GPU-bound workloads and building a highly scalable profiling platform purpose-built for today's distributed, heterogeneous environments — spanning both general-purpose and AI-accelerated workloads.

View OpenTelemetry eBPF Profiler

FAQ

Frequently Asked Questions

How does GPU monitoring complement GPU profiling?

GPU monitoring provides high-level visibility into performance trends, resource utilization, and hardware health. Use monitoring to identify interesting patterns or anomalies, then click on specific data points to drill down into detailed GPU profiles for kernel-level analysis and optimization.

What GPU metrics are available for monitoring?

zymtrace collects complete GPU metrics including utilization, memory usage, power consumption, temperature, SM efficiency, SM occupancy, Tensor Core utilization, and data transfer rates. Process-level breakdowns help identify which applications consume the most resources. All the GPU metrics we collected are listed here. If there are other metrics you’d like to see supported, let us know at [email protected].

Do I need special hardware for GPU monitoring?

Basic GPU monitoring works on all NVIDIA GPUs with CUDA support. Advanced metrics like SM Efficiency, SM Occupancy, and Tensor Core Utilization are available only on data center GPUs (H100s, H200s, etc.). Consumer GPUs still get full utilization, memory, power, and temperature monitoring.

How is GPU monitoring data collected?

zymtrace uses NVIDIA's NVML (NVIDIA Management Library) to collect real-time GPU metrics directly from the hardware. This provides performance, utilization, and health data with minimal overhead on your workloads.

Can I monitor multiple GPUs across a cluster?

Yes! zymtrace provides distributed cluster-wide GPU monitoring. Monitor utilization, memory usage, and performance across all GPUs in your infrastructure from a single dashboard, making it easy to identify underutilized resources or performance hotspots. All the GPU metrics we collected are listed here. If there are other metrics you’d like to see supported, let us know at [email protected].

Ready to Monitor Your GPU Infrastructure?

Get unified visibility into GPU performance and hardware health across your cluster.

Start Monitoring