GPU Infrastructure Health

GPU Monitoring

Unified GPU Infra Observability

Get a complete performance view of your GPU infrastructure at a glance. Monitor GPU memory usage, power draw, SM occupancy, SM efficiency, NVLink throughput, and PCIe health across your entire cluster.

Start with metrics to form hypotheses, then click data points to dive into detailed GPU profiles.

Click to expand

Real-time GPU monitoring dashboard with health metrics and performance insights

Unified GPU Infrastructure Observability

From GPU Monitoring to CUDA Insights

GPU metrics alone stop short—they tell you WHAT is happening, not WHY. zymtrace combines monitoring with deep CUDA profiling to close the loop from detection to optimization.

1

Monitor Metrics

Track GPU utilization, memory usage, power consumption, and hardware health across your infrastructure. Identify performance trends and potential bottlenecks.

2

Form Hypotheses

Use process-level breakdowns to identify which applications consume the most resources. Spot anomalies in utilization patterns or memory pressure issues.

3

Deep Dive Profiles

Click on interesting data points to navigate to detailed GPU profiles. Analyze CUDA kernels, instruction-level execution, and optimization opportunities.

OpenTelemetry

OpenTelemetry Compliant

zymtrace is OpenTelemetry compliant, including support for OTEL resource attributes.

Fun Fact

The zymtrace team were part of the team that pioneered, open-sourced, and donated the eBPF profiler to OpenTelemetry.
With zymtrace, we’re extending that same low-level engineering excellence to GPU-bound workloads and building a highly scalable profiling platform purpose-built for today’s distributed, heterogeneous environments — spanning both general-purpose and AI-accelerated workloads.

FAQ

Frequently Asked Questions

GPU monitoring provides high-level visibility into performance trends, resource utilization, and hardware health. Use monitoring to identify interesting patterns or anomalies, then click on specific data points to drill down into detailed GPU profiles for kernel-level analysis and optimization.
zymtrace collects complete GPU metrics including utilization, memory usage, power consumption, temperature, SM efficiency, SM occupancy, Tensor Core utilization, and data transfer rates. Process-level breakdowns help identify which applications consume the most resources. All the GPU metrics we collected are listed here. If there are other metrics you’d like to see supported, let us know at [email protected].
Basic GPU monitoring works on all NVIDIA GPUs with CUDA support. Advanced metrics like SM Efficiency, SM Occupancy, and Tensor Core Utilization are available only on data center GPUs (H100s, H200s, etc.). Consumer GPUs still get full utilization, memory, power, and temperature monitoring.
zymtrace uses NVIDIA's NVML (NVIDIA Management Library) to collect real-time GPU metrics directly from the hardware. This provides performance, utilization, and health data with minimal overhead on your workloads.
Yes! zymtrace provides distributed cluster-wide GPU monitoring. Monitor utilization, memory usage, and performance across all GPUs in your infrastructure from a single dashboard, making it easy to identify underutilized resources or performance hotspots. All the GPU metrics we collected are listed here. If there are other metrics you’d like to see supported, let us know at [email protected].

Ready to Monitor Your GPU Infrastructure?

Get unified visibility into GPU performance and hardware health across your cluster.

Start Monitoring