51 points by adilhafeez 10 hours ago | 6 comments
pseudosavant 3 hours ago
Not that LLMs are terribly latency sensitive (you wait on a lot of tokens), but what kind of latency impact does this have on requests that go through the proxy?
adilhafeez 3 hours ago
Short answer is latency impact is very minimal.

We use envoy as request handler which forwards request to local service written in rust. Envoy is proven to be high performance, low latency and highly efficient on request handling. If I have to put a number it would be in single digit ms per request. I will have more detailed benchmark in the coming days.

cotran2 3 hours ago
The model is compact 1.5B, most GPUs can serve it locally and has <100ms e2e latency. For L40s, its 50ms.
jgant13 5 hours ago
Solid. Can you show us when to use this vs. say OpenRouter? The performance seems strong for sure. TIA.
sparacha 4 hours ago
Arch is developer friendly, but designed for enterprise-grade customers in mind. The core contributors of Envoy redesigned the proxy substrate to handle prompts - offering something that is battle tested in terms of resiliency, speed, and deployments. Second, OpenRouter offers choice of models, but dynamically routing to LLMs based on user-defined usage policies is uniquely available in Arch. Hope that helps
sparacha 10 hours ago
Hi HN! I am one of the co-authors of the paper. If there are any questions about our approach, I would love to answer them.
_nh_ 4 hours ago
How do you compare with RouteLLM?
sparacha 3 hours ago
RouteLLM is essentially a benchmark-driven approach. Their framework chooses between a weak and a strong model and helps developers optimize for a metric called APGR (Average Performance Gap Recovered) — a measure of how much of the stronger model’s performance can be recovered when routing some queries to the weaker, cheaper model. However, their routing models are trained to maximize performance on public benchmarks like MMLU, BBH, or MT-Bench. These benchmarks may not capture subjective, domain-specific quality signals that surface in practice.

Arch-Router takes a different approach. Instead of focusing benchmark scores, we lets developers define routing policies in plain language based on their preferences — like “contract analysis → GPT-4o” or “lightweight brainstorming → Gemini Flash.” Our 1.5B model learns to map prompts (along with conversational context) to these policies, enabling routing decisions that align with real-world expectations, not abstract leaderboards. Also our approach doesn't require router model retraining when new LLMs are swapped in or when preferences change.

Hope this helps.

cotran2 3 hours ago
There is a case study comparing with RouteLLM in the appendix.
tmaly 7 hours ago
do you think it would be possible to quantize this model and still get good results?
sparacha 7 hours ago
yes - we have already published a quantized version here: https://huggingface.co/katanemo/Arch-Router-1.5B.gguf. The performance difference with a quant version is negligible. I'll run another analysis and update the thread shortly
sparacha 6 hours ago
Overall performance degrades from 93.17 -> 92.99 with a quantized version
6 hours ago
jedisct1 6 hours ago
I tried to use it to rate the difficulty level of coding tasks (for InferSwitch, an LLM router), but it performed far worse than Qwen2.5-Coder-7B (but sure, 1.5B vs 7B)
sparacha 6 hours ago
Can you share more about your evaluation setup? I would love to see the specific usage pattern as we have tested our model against smaller LLMs and foundational models and our results show things differently. Of course, routing policies should follow best practices here: https://docs.archgw.com/guides/llm_router.html

Nonetheless, super curious to learn more and see what we may be able to improve. This is technically not a classifier model - its a usage prediction model (feels like a classifier, but not quite in terms of intended usage)

cotran2 6 hours ago
According to the post, the model is fine-tuned for routing to different tasks/domains. Classifying difficulty level is probably not the intended use case.