
DeepScreen (Local AI Résumé Shortlisting)
Local-first résumé shortlisting in Rust — parses a folder of PDFs, embeds them with quantized ONNX models, and ranks candidates against a job spec with a 3-stage pipeline: Qdrant vector search, cross-encoder reranking, and BM25 lexical scoring.
Problem
Résumé screening is the highest-volume, lowest-joy task in hiring, and the usual fix makes it worse: uploading a folder of candidate PDFs to a cloud ATS or an LLM API sends personal data off the box, often into a vendor's training pipeline. DeepScreen ranks a folder of résumés against a job description entirely on your machine — no cloud calls, nothing leaving the host, runnable fully offline. It ships as a single Rust release binary so a recruiter can shortlist a stack of PDFs with one command and a reproducible result.
Architecture
Each résumé PDF is converted to markdown and parsed into four structured sections — skills, responsibilities, experience, and summary — using a local LLM when one is reachable and falling back to a fast heuristic parser otherwise. The sections are embedded with a quantized all-mpnet-base-v2 ONNX model (int8) and indexed in Qdrant. Ranking then runs in three stages against the embedded job description: a Qdrant weighted vector search recalls the candidate pool, an ms-marco-MiniLM-L12-v2 cross-encoder re-scores each candidate per field, and a BM25 lexical pass rewards exact keyword overlap. The three signals fuse into a final composite — 0.20 × Qdrant + 0.65 × cross-encoder + 0.15 × BM25, each min-max normalised within the top-K pool. The whole run is resumable and produces a single, reproducible shortlist.
Tech decisions & trade-offs
Why the whole pipeline runs locally
Résumés are personal data, and the moment they touch a hosted API the privacy story is gone — terms of service, retention windows, and training-data clauses all become someone else's policy. DeepScreen keeps embeddings, vector search, and reranking on the host so candidate data never leaves the machine, and the tool works on a plane with the Wi-Fi off. The trade-off is that there's no managed-service convenience: models are bundled and run on the CPU, and the user owns the hardware envelope rather than renting it.
Why int8-quantized ONNX models
The accuracy ceiling for screening doesn't need a GPU — it needs to run anywhere a recruiter's laptop runs. Quantizing all-mpnet-base-v2 and the MiniLM cross-encoder to int8 and serving them through ONNX Runtime keeps inference CPU-friendly and the binary self-contained, with no CUDA toolchain to install. The trade-off is a small, measured drop in raw embedding fidelity versus full-precision PyTorch — paid back many times over in portability and a cold-start that doesn't depend on a GPU being present.
Why three ranking stages instead of one model
No single scorer is good at everything, so DeepScreen layers three. A bi-encoder vector search (Qdrant) is cheap and scales to a whole folder, but it scores each résumé in isolation and misses fine-grained matches — it's the fast recall stage. A cross-encoder reads the job description and a résumé together and is far more precise, but too expensive to run against everyone, so it only re-scores the recalled pool. BM25 then adds what neither embedding stage captures well: exact lexical overlap, so a hard requirement spelled out verbatim isn't smoothed away by semantic similarity. The three fuse into a weighted composite — 0.20 × Qdrant + 0.65 × cross-encoder + 0.15 × BM25 — leaning on the cross-encoder for judgement while letting recall and keyword signals correct its blind spots. The result is cross-encoder precision at bi-encoder cost, with a lexical safety net.
Why Rust and a single binary
Screening is a batch job a non-engineer should be able to run, repeatedly, and trust. Rust gives a single statically-linked release binary — no Python environment to reconstruct, no dependency drift between runs — and the pipeline is built to be resumable, so a long folder that's interrupted picks up where it stopped. The cost is a slower build and a steeper authoring curve than a Python script; the payoff is a tool that produces the same shortlist on someone else's machine months later.