Fresh Hacker News | How HN: Ironkernel – Python expressions, Rust parallel

▲How HN: Ironkernel – Python expressions, Rust parallel(github.com)

30 points by acc_10000 3 days ago | 5 comments

▲acc_10000 3 days ago

I built this after watching 7/8 CPU cores idle during a Monte Carlo sim. multiprocessing added 189ms serialization overhead to a 9ms computation.

ironkernel lets you write element-wise expressions with a Python decorator, compiles them to a Rust expression tree at definition time, and executes via rayon on all cores. ~2k lines of Rust, ~500 lines of Python.

The win is expression fusion: NumPy evaluates `where(x > 0, sqrt(abs(x)) + sin(x), 0)` as 5 passes with 4 temporaries. ironkernel fuses into 1 pass, zero temporaries, and skips dead branches (no NaN from sqrt of negatives). 2.25x NumPy on compound expressions at 10M elements. For BLAS ops like SAXPY, NumPy is faster — ironkernel doesn't call BLAS.

Early stage: f64 only, 1-D only, expression subset only (intentional — parallel safety guarantee). Numba warm is 3.2x faster (LLVM JIT vs interpreter).

▲nickpsecurity 4 hours ago

Thanks for this! Parallel on Python is always a pain point. I'm always grateful for each tool one of you builds to help us speed up our code. :)

▲ata-sesli 6 hours ago

The expression fusion win is huge for cache locality. Since you're using Rayon for the multicore side, I'm curious if the generated Rust expression tree is 'flat' enough for LLVM to trigger auto-vectorization (SIMD) on the individual cores or if the tree traversal adds enough branching to break that?

▲stephantul 4 hours ago

Do you have benchmarks? Naively I would compare this to Numba? But maybe I am way off the mark here

▲KeplerBoy 6 hours ago

For the love of god, don't use these ai generated infographics/diagrams.

If that's your bar for quality, I'll think less of your code. I can't help it.

Also your saxpy example seems to be daxpy. s and d are short for single or double precision.

▲dgacmu 5 hours ago

As a specific example: The generated diagram showing the expression tree under "build in python" is simply wrong. It doesn't correspond to the expression x * 2 + 1, which should have only 1 child node on the right. The "GIL Released - Released" is just confusing. The dataflow omits the fact that the results end up back in python - there should be a return arrow. etc., etc.

If you use diagrams like this, at least ensure they are accurately conveying the right understanding.

And in general, listen to the person I'm responding to -- be really deliberate with your graphics or omit. Most AI-generated diagrams are crap.

▲porridgeraisin 2 hours ago

> Also your saxpy example seems to be daxpy. s and d are short for single or double precision.

That's a great catch — attention to detail like that is what separates a kernel engineer from a *numerical computing expert*. You were right, "S" and "D" in BLAS naming refer to single and double precision respectively — so that was DAXPY, not SAXPY. Let me rewrite the kernel with the proper type...

▲alephnerd 4 hours ago

I think other HNers need to keep an eye on these kinds of projects - a decade ago these would have required a team of 3-4 engineers around 1 quarter to build a prototype for, but now we can see one SWE do the same while leveraging Claude Code.

Plenty of people on HN wish to bury their head under the sand, but this highlights how critical it is becoming to be both a good engineer and adept at using agentic tooling within your development lifecycle.

▲krapht 4 hours ago

Is the code actually good though? Not seeing any benchmarks vs numexpr, numba, or Jax

▲whattheheckheck 3 hours ago

So what are 3-4 engineers building in a quarter now?

▲JackSlateur 4 hours ago

Also: https://news.ycombinator.com/item?id=46789265

Bury their head under the sand ? Maybe, maybe not