ironkernel lets you write element-wise expressions with a Python decorator, compiles them to a Rust expression tree at definition time, and executes via rayon on all cores. ~2k lines of Rust, ~500 lines of Python.
The win is expression fusion: NumPy evaluates `where(x > 0, sqrt(abs(x)) + sin(x), 0)` as 5 passes with 4 temporaries. ironkernel fuses into 1 pass, zero temporaries, and skips dead branches (no NaN from sqrt of negatives). 2.25x NumPy on compound expressions at 10M elements. For BLAS ops like SAXPY, NumPy is faster — ironkernel doesn't call BLAS.
Early stage: f64 only, 1-D only, expression subset only (intentional — parallel safety guarantee). Numba warm is 3.2x faster (LLVM JIT vs interpreter).
If that's your bar for quality, I'll think less of your code. I can't help it.
Also your saxpy example seems to be daxpy. s and d are short for single or double precision.
If you use diagrams like this, at least ensure they are accurately conveying the right understanding.
And in general, listen to the person I'm responding to -- be really deliberate with your graphics or omit. Most AI-generated diagrams are crap.
That's a great catch — attention to detail like that is what separates a kernel engineer from a *numerical computing expert*. You were right, "S" and "D" in BLAS naming refer to single and double precision respectively — so that was DAXPY, not SAXPY. Let me rewrite the kernel with the proper type...
Plenty of people on HN wish to bury their head under the sand, but this highlights how critical it is becoming to be both a good engineer and adept at using agentic tooling within your development lifecycle.
Bury their head under the sand ? Maybe, maybe not