AI-MLRustONNX RuntimeQdrantMPNetCross-EncoderBM25

DeepScreen (Local AI Résumé Shortlisting)

Local-first résumé shortlisting in Rust — parses a folder of PDFs, embeds them with quantized ONNX models, and ranks candidates against a job spec with a 3-stage pipeline: Qdrant vector search, cross-encoder reranking, and BM25 lexical scoring.

Source Download for Windows

Problem

Résumé screening is the highest-volume, lowest-joy task in hiring, and the usual fix makes it worse: uploading a folder of candidate PDFs to a cloud ATS or an LLM API sends personal data off the box, often into a vendor's training pipeline. DeepScreen ranks a folder of résumés against a job description entirely on your machine — no cloud calls, nothing leaving the host, runnable fully offline. It ships as a single Rust release binary so a recruiter can shortlist a stack of PDFs with one command and a reproducible result.

Architecture

Each résumé PDF is converted to markdown and parsed into four structured sections — skills, responsibilities, experience, and summary — using a local LLM when one is reachable and falling back to a fast heuristic parser otherwise. The sections are embedded with a quantized all-mpnet-base-v2 ONNX model (int8) and indexed in Qdrant. Ranking then runs in three stages against the embedded job description: a Qdrant weighted vector search recalls the candidate pool, an ms-marco-MiniLM-L12-v2 cross-encoder re-scores each candidate per field, and a BM25 lexical pass rewards exact keyword overlap. The three signals fuse into a final composite — 0.20 × Qdrant + 0.65 × cross-encoder + 0.15 × BM25, each min-max normalised within the top-K pool. The whole run is resumable and produces a single, reproducible shortlist.

Tech decisions & trade-offs

Why the whole pipeline runs locally

Résumés are personal data, and the moment they touch a hosted API the privacy story is gone — terms of service, retention windows, and training-data clauses all become someone else's policy. DeepScreen keeps embeddings, vector search, and reranking on the host so candidate data never leaves the machine, and the tool works on a plane with the Wi-Fi off. The trade-off is that there's no managed-service convenience: models are bundled and run on the CPU, and the user owns the hardware envelope rather than renting it.

Why int8-quantized ONNX models

The accuracy ceiling for screening doesn't need a GPU — it needs to run anywhere a recruiter's laptop runs. Quantizing all-mpnet-base-v2 and the MiniLM cross-encoder to int8 and serving them through ONNX Runtime keeps inference CPU-friendly and the binary self-contained, with no CUDA toolchain to install. The trade-off is a small, measured drop in raw embedding fidelity versus full-precision PyTorch — paid back many times over in portability and a cold-start that doesn't depend on a GPU being present.

Why three ranking stages instead of one model

No single scorer is good at everything, so DeepScreen layers three. A bi-encoder vector search (Qdrant) is cheap and scales to a whole folder, but it scores each résumé in isolation and misses fine-grained matches — it's the fast recall stage. A cross-encoder reads the job description and a résumé together and is far more precise, but too expensive to run against everyone, so it only re-scores the recalled pool. BM25 then adds what neither embedding stage captures well: exact lexical overlap, so a hard requirement spelled out verbatim isn't smoothed away by semantic similarity. The three fuse into a weighted composite — 0.20 × Qdrant + 0.65 × cross-encoder + 0.15 × BM25 — leaning on the cross-encoder for judgement while letting recall and keyword signals correct its blind spots. The result is cross-encoder precision at bi-encoder cost, with a lexical safety net.

Why Rust and a single binary

Screening is a batch job a non-engineer should be able to run, repeatedly, and trust. Rust gives a single statically-linked release binary — no Python environment to reconstruct, no dependency drift between runs — and the pipeline is built to be resumable, so a long folder that's interrupted picks up where it stopped. The cost is a slower build and a steeper authoring curve than a Python script; the payoff is a tool that produces the same shortlist on someone else's machine months later.

Repositories

DeepScreen-LocalSource · pushed Jun 24, 20262