Career
Data Software Engineer (Recommendation Systems, Rust)
Fully Remote (Global)
About zymtrace
Organizations spend billions on GPU infrastructure to power AI, yet roughly 60-65% of that investment is wasted on underutilized hardware, idle cycles, and inefficient workloads. The problem isn’t the teams running them — it’s that existing tools treat GPUs as black boxes, showing surface-level metrics without revealing where the waste actually lives.
zymtrace is a distributed AI infrastructure optimization platform that gives our customers deep, always-on visibility into GPU-accelerated workloads across their entire clusters. We profile from PyTorch and JAX code through CUDA kernels all the way down to individual GPU instructions and stall reasons, then correlate everything back to the CPU code that triggered it. Zero code changes. Zero guesswork.
Leading AI labs, Fortune 500 companies, and quantitative research firms trust zymtrace to debug and optimize their most demanding workloads, achieving results like 300% inference speedups and millions in annual infrastructure savings.
Our founders pioneered, open-sourced, and contributed the eBPF CPU profiler to OpenTelemetry, the same technology now adopted by Grafana, Datadog, Google, and others. We’re now applying that same low-level engineering depth to GPU-bound workloads as an NVIDIA Inception partner.
We’re a team of kernel hackers and systems programmers who operate at the deepest layers of the stack: GPUs, CUDA runtimes, eBPF, compilers, and instruction-level introspection. That depth isn’t a feature. It’s the product. By joining zymtrace, you’ll work at the bleeding edge of modern computing, helping organizations optimize AI training and inference workloads at massive scale. The problems we solve touch every layer of the stack, and the impact is measured in millions of GPU-hours reclaimed.
About the Role
This role sits at the intersection of systems engineering and data science.
We are seeking a Data Software Engineer to accelerate our work on building recommendation systems that transform massive volumes of observability data (starting with GPU & CPU profiling signals) into actionable insights. Not just insights—you’ll also explore cutting-edge integration with S/LLM agents to automatically generate, review and suggest pull requests.
You’ll design and implement recommendation algorithms that automatically identify performance bottlenecks, suggest optimizations, and prioritize the most impactful improvements—by cost, carbon emission and arbitrary priotization criteria factors.
Working primarily with ClickHouse as your data backbone, you’ll turn raw system telemetry into intelligent, context-aware insight that helps AI/ML engineers answer performance questions quickly—rather than combing through multiple dashboards, flamegraphs and visualizations.
All our services are built in Rust, so you must be comfortable in reading and writing Rust 🦀.
Must-Have Qualifications
- Experience designing and implementing recommendation systems or similar ML/AI-driven features
- Experience with Rust for building high-performance data systems
- Proven ability to work with large datasets and optimize for real-time data processing
- Understanding of systems performance concepts and profiling data
- Ability to work independently and communicate clearly in a distributed team environment
Nice-to-Have Qualifications
- Experience with ClickHouse for analytics workloads and large-scale data operations
- Experience with running LLMs locally and massive bonus for distilling LLMs into SLMs
- Familiarity with observability tools and performance monitoring systems
- Knowledge of GPU profiling and accelerated computing workloads
- Background in systems performance optimization
Why Join zymtrace?
- Fully remote work environment with flexible hours
- Work on a meaningful project with a world-class team
- Use and contribute to cutting-edge technologies like
Rust,eBPF,ClickHouse,Nix, andWASM - Competitive compensation and benefits package