Data Engineer, Scientific Data Ingestion
Company: Mithrl
Location: San Francisco
Posted on: April 2, 2026
|
|
|
Job Description:
ABOUT MITHRL We envision a world where novel drugs and therapies
reach patients in months, not years, accelerating breakthroughs
that save lives. Mithrl is building the world’s first commercially
available AI Co-Scientist—a discovery engine that empowers life
science teams to go from messy biological data to novel insights in
minutes. Scientists ask questions in natural language, and Mithrl
answers with real analysis, novel targets, and patent-ready
reports. Our traction speaks for itself: 12X year-over-year revenue
growth Trusted by leading biotechs and big pharma across three
continents Driving real breakthroughs from target discovery to
patient outcomes. WHAT YOU WILL DO Build and own an AI-powered
ingestion & normalization pipeline to import data from a wide
variety of sources — unprocessed Excel/CSV uploads, lab and
instrument exports, as well as processed data from internal
pipelines. Develop robust schema mapping, coercion, and conversion
logic (think: units normalization, metadata standardization,
variable-name harmonization, vendor-instrument quirks, plate-reader
formats, reference-genome or annotation updates, batch-effect
correction, etc.). Use LLM-driven and classical data-engineering
tools to structure “semi-structured” or messy tabular data —
extracting metadata, inferring column roles/types, cleaning
free-text headers, fixing inconsistencies, and preparing final
clean datasets. Ensure all transformations that should only happen
once (normalization, coercion, batch-correction) execute during
ingestion — so downstream analytics / the AI “Co-Scientist” always
works with clean, canonical data. Build validation, verification,
and quality-control layers to catch ambiguous, inconsistent, or
corrupt data before it enters the platform. Collaborate with
product teams, data science / bioinformatics colleagues, and
infrastructure engineers to define and enforce data standards, and
ensure pipeline outputs integrate cleanly into downstream analysis
and storage systems. WHAT YOU BRING Must-have 5 years of experience
in data engineering / data wrangling with real-world tabular or
semi-structured data. Strong fluency in Python, and data processing
tools (Pandas, Polars, PyArrow, or similar). Excellent experience
dealing with messy Excel / CSV / spreadsheet-style data —
inconsistent headers, multiple sheets, mixed formats, free-text
fields — and normalizing it into clean structures. Comfort
designing and maintaining robust ETL/ELT pipelines, ideally for
scientific or lab-derived data. Ability to combine classical data
engineering with LLM-powered data normalization / metadata
extraction / cleaning. Strong desire and ability to own the
ingestion & normalization layer end-to-end — from raw upload ?
final clean dataset — with an eye for maintainability,
reproducibility, and scalability. Good communication skills; able
to collaborate across teams (product, bioinformatics, infra) and
translate real-world messy data problems into robust engineering
solutions. Nice-to-have Familiarity with scientific data types and
“modalities” (e.g. plate-readers, genomics metadata, time-series,
batch-info, instrumentation outputs). Experience with workflow
orchestration tools (e.g. Nextflow, Prefect, Airflow, Dagster), or
building pipeline abstractions. Experience with cloud
infrastructure and data storage (AWS S3, data lakes/warehouses,
database schemas) to support multi-tenant ingestion. Past exposure
to LLM-based data transformation or cleansing agents — building or
integrating tools that clean or structure messy data automatically.
Any background in computational biology / lab-data / bioinformatics
is a bonus — though not required. WHAT YOU WILL LOVE AT MITHRL
Mission-driven impact: you’ll be the gatekeeper of data quality —
ensuring that all scientific data entering Mithrl becomes clean,
consistent, and analysis-ready. You’ll have outsized influence over
the reliability and trustworthiness of our entire data AI stack.
High ownership & autonomy: this role is yours to shape. You decide
how ingestion works, define the standards, build the pipelines.
You’ll work closely with our product, data science, and
infrastructure teams — shaping how data is ingested, stored, and
exposed to end users or AI agents. Team: Join a tight-knit,
talent-dense team of engineers, scientists, and builders Culture:
We value consistency, clarity, and hard work. We solve hard
problems through focused daily execution Speed: We ship fast
(2x/week) and improve continuously based on real user feedback
Location: Beautiful SF office with a high-energy, in-person culture
Benefits: Comprehensive PPO health coverage through Anthem
(medical, dental, and vision) 401(k) with top-tier plans We
encourage you to apply even if you do not believe you meet every
single qualification. Not all strong candidates will meet every
single qualification as listed. Research shows that people who
identify as being from underrepresented groups are more prone to
experiencing imposter syndrome and doubting the strength of their
candidacy, so we urge you not to exclude yourself prematurely and
to submit an application if you're interested in this work. We
think AI systems like the ones we're building have enormous social
and ethical implications. We think this makes representation even
more important, and we strive to include a range of diverse
perspectives on our team.
Keywords: Mithrl, Davis , Data Engineer, Scientific Data Ingestion, Science, Research & Development , San Francisco, California