Tim Osterhus

Local model evaluation notes

Model Info

The interactive resume can route questions through different local model paths. The goal is not to chase generic leaderboard scores. The useful question is whether a model can answer from Tim's evidence corpus, cite sources cleanly, respect missing evidence, and do it quickly enough for a public site.

These numbers are the latest completed local runs available as of May 28, 2026 HST. Rows are labeled when they come from an older corpus snapshot, so this page should be read as an operational model-selection snapshot, not a final benchmark paper.

The current default Fast model is granite-4.1-3b-tim-resume:latest, a local Granite 4.1 3B fine-tune selected from the V4 backend-raw checkpoint sweep.

100-Question Corpus Coverage Eval

This eval checks broad resume coverage. It asks 100 role-aware questions across investor, recruiter, entrepreneur, builder, and friend perspectives. Most questions are source-grounded; a smaller set checks whether the model can say when evidence is missing instead of inventing facts.

The score has two useful parts. The pass count is strict: the answer has to satisfy the required claims and boundaries. The partial score gives fractional credit for getting some required pieces right, so it is often a better signal when comparing close local models.

Model Profile Result Avg latency Notes
granite-4.1-3b-tim-resume:latest Fast 67/100, 94.324/100 partial 6.282s Current default model on corpus local-08ed82e14185; fresh May 28 rerun, max latency 14.390s, 0 citation failures.
hf.co/ibm-granite/granite-4.1-3b-GGUF:Q4_K_M Fast 60/100, 92.199/100 partial 16.670s Base Granite comparison on corpus local-08ed82e14185; max latency 39.344s, 0 citation failures.
hf.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF:Q4_K_M Fast 70/100, 93.759/100 partial 12.653s Latest saved 100-question run is from corpus local-77763852aa26; 0 citation failures.
qwen3.5:2b Fast 66/100, 93.743/100 partial 9.401s Latest saved 100-question run is from corpus local-77763852aa26; 0 citation failures.
granite4:tiny-h Fast 69/100, 94.209/100 partial 9.773s Legacy tiny comparison model on corpus local-77763852aa26; 0 citation failures.
jackrong-qwen35-fixed:latest Thinking No full current 100-question run; latest strict coverage-style run was 2/25, 20.600/25 partial 51.001s Older local-61da9f7ebdc7 25-question strict run; used for experimental Deep answers and Builder thinking comparison.
qwen3.5:4b Thinking 74/100 15.120s Older corpus local-61da9f7ebdc7 comparison run; retained as historical Deep-model context.

25-Question Deep Synthesis Eval

This eval is harder. It asks 25 multi-source questions that need synthesis across several documents, not just one retrieved snippet. Each row has expected claims, required boundaries, forbidden claims, and a frozen evidence pack.

Newer rows use subagent semantic judging on a 10-point scale. Older rows are normalized from earlier manual grades. The table reports semantic quality, answer bucket split, latency, and citation hygiene.

Model Profile Result Avg latency Notes
granite-4.1-3b-tim-resume:latest Fast 8.8/10; 23 strong, 2 passable, 0 problematic 4.390s Current default model on corpus local-08ed82e14185; 0 errors and 0 citation failures.
hf.co/ibm-granite/granite-4.1-3b-GGUF:Q4_K_M Fast 8.1/10; 20 strong, 5 passable, 0 problematic 9.287s Base Granite comparison on corpus local-08ed82e14185; 0 errors and 0 citation failures.
jackrong-qwen35-fixed:latest Fast 9.11/10; 22 pass, 3 borderline, 0 fail 16.126s Older six-model synthesis batch on corpus local-77763852aa26; high semantic grade but much slower than the default Fast model.
jackrong-qwen35-fixed:latest Thinking 9.00/10; 21 pass, 4 borderline, 0 fail 74.608s Experimental Deep route; useful for comparison, but too slow for normal public traffic.
hf.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF:Q4_K_M Fast 8.60/10; 19 pass, 5 borderline, 1 fail 5.239s Older six-model synthesis batch on corpus local-77763852aa26; 0 errors and 0 citation failures.
granite4:tiny-h Fast 7.39/10; 15 pass, 5 borderline, 5 fail 10.503s Legacy tiny comparison from corpus local-77763852aa26; malformed-output and privacy-wording risks were noted.
qwen3.5:2b Thinking Not in six-model manual grade; no-judge run clean 104.147s Completed all 25 with 0 errors and 0 citation failures, but the thinking profile was far too slow for the Builder dropdown.
qwen3.5:4b Thinking 1/25, 20.188/25 partial 21.300s Older strict-scored run, not part of the six-model manual grade. The Builder catalog exposes Qwen comparison models with public thinking disabled.

Current Takeaway

Normal resume Fast is currently pinned to granite-4.1-3b-tim-resume:latest. The Builder dropdown is for comparison, with the Tim Resume Granite fine-tune as the default, base Granite as a reference point, Nemotron and Qwen as small-model comparisons, and jackrong-qwen35-fixed:latest as the only public thinking-capable option. Deep reasoning remains experimental because it increases latency sharply.