Head of Performance Profiling
Company: Etched
Location: San Jose
Posted on: April 2, 2026
|
|
|
Job Description:
About Etched Etched is building the world’s first AI inference
system purpose-built for transformers - delivering over 10x higher
performance and dramatically lower cost and latency than a B200.
With Etched ASICs, you can build products that would be impossible
with GPUs, like real-time video generation models and extremely
deep & parallel chain-of-thought reasoning agents. Backed by
hundreds of millions from top-tier investors and staffed by leading
engineers, Etched is redefining the infrastructure layer for the
fastest growing industry in history. Job Summary We are hiring a
Head of Performance Profiling to define how performance is
understood across next-generation AI accelerator systems. Our ML
accelerator platform spans custom silicon, supercomputing software,
compiler stacks, runtime libraries, and distributed inference
environments. Performance at this scale is no longer a device-level
question — it is a high-performance distributed system problem. You
will define the performance metrics that connect raw hardware
signals to distributed workload context, ML cluster dynamics, pod
communication patterns, and emergent bottlenecks. This role
requires more than telemetry. You will establish new abstractions,
structured counter ontologies, cross-layer event correlation
frameworks, distributed time-alignment strategies, and scalable
reasoning systems operating across nodes, racks, and clusters.
Working at the intersection of hardware design, driver
architecture, runtime systems, and ML infrastructure, you will
shape how these layers expose and consume performance intelligence.
This is a foundational role defining not just tooling, but how our
platform reasons about efficiency, scalability, and system behavior
for years to come. Key Responsibilities System-Level Performance
Design Define the architectural approach for collecting and
structuring telemetry across CPUs, drivers, interconnects, and
multiple accelerators Design scalable models for correlating
performance events across device and host boundaries Cross-Layer
Event Correlation Develop mechanisms to align hardware counters,
runtime activity, communication phases, and workload semantics
across model-layer execution into coherent, actionable insight
Implement time synchronization and trace-alignment strategies
across multi-device systems Telemetry & Counter Modeling Define
structured counter taxonomies separating base signals from derived
metrics Design derived performance models bridging low-level
hardware signals and workload-level behavior Influence
instrumentation strategy for future hardware generations
Distributed Performance Reasoning Build tools that identify
bottlenecks among multi-accelerator workloads across chips within
hosts Build cluster-scale performance analysis for distributed
inference across data center networks Tooling & Insight Delivery
Contribute to analysis engines and developer-facing tooling that
transform raw telemetry into intuitive insight Shape how
performance intelligence is surfaced to engineers debugging
large-scale AI systems You may be a good fit if you have Deep
experience building complex systems at the intersection of hardware
and software Personally envisioned and built significant portions
of profiling, tracing, or observability systems — not solely
defined requirements or product strategy Demonstrated ability to
translate raw hardware signals into scalable, production-grade
telemetry and analysis infrastructure Experience correlating
time-series events across distributed systems Deep systems
programming expertise (C++ or Rust), with a track record of
shipping low-level infrastructure operating close to hardware or
runtime systems Experience designing distributed correlation
mechanisms, timestamp-alignment strategies, or performance modeling
frameworks across multiple devices or hosts A history of
introducing new technical abstractions or counter models that
materially improved how engineers debug and optimize systems
Experience designing distributed tracing or observability platforms
at scale Experience with high-performance computing systems and
large AI training clusters Experience with timestamp
synchronization strategies and event alignment in distributed
environments Experience with hardware counter design and
instrumentation strategy Experience with runtime systems, compiler
internals, or scheduling frameworks Experience with performance
modeling for large-scale ML workloads Experience leading
cross-functional architectural initiatives spanning hardware and
software teams Benefits Medical, dental, and vision packages with
generous premium coverage $500 per month credit for waiving medical
benefits Housing subsidy of $2k per month for those living within
walking distance of the office Relocation support for those moving
to San Jose (Santana Row) Various wellness benefits covering
fitness, mental health, and more Daily lunch and dinner in our
office How we’re different Etched believes in the Bitter Lesson .
We think most of the progress in the AI field has come from using
more FLOPs to train and run models, and the best way to get more
FLOPs is to build model-specific hardware. Larger and larger
training runs encourage companies to consolidate around fewer model
architectures, which creates a market for single-model ASICs. We
are a fully in-person team in San Jose and Taipei, and greatly
value engineering skills. We do not have boundaries between
engineering and research, and we expect all of our technical staff
to contribute to both as needed.
Keywords: Etched, Davis , Head of Performance Profiling, Engineering , San Jose, California