I haven't used a non-laptop GPU in some time, but that is a crazy amount of "idle" power consumption. Is this normal for cards like this?
For server gear it’s more common to have less dynamic power and voltage switching because it produces more predictable performance and latency.
That aside idle power consumption is a driver-to-driver affair from both amd and novideo, sometimes I'm only pulling 15-30W when nothing is happening and other times it decides it needs 110w for a static 500hz screen
[0] - https://stackoverflow.com/questions/11227809/why-is-conditio...
GPUs do branch prediction? I thought they didn't bother and try to minimize wasted effort by using high amounts of concurrent threads?
And what of the Pi test - I’d expect that to flip many more bits than the 1-bit one.
Normal vs uniform is less clear, but also not as much of a difference. The arguments about signs isn't just about a signs bit, though. The way you negate during accumulation is that you flip all the bits. Only the final float representation is sign+magnitude, the accumulation itself has two's complement steps. I don't actually know the analysis here, just pointing out that it's not that simple.
If you switch a not gate's input from zero to one to zero and so on, the gate capacitance will have to charge and discharge. The entire idea behind CMOS is that if you have n and p channel transistors together, you can take advantage of the fact that electrons are more mobile than holes. Filling and draining electrons gives you a greater switching speed.
If the input stays the same, then the charge at the input inside the flip flop is the same as the charge inside the not gate. No charge differential means no electrons move, which means there is no ohmic resistance that causes the internal metal and polysilicon interconnect to heat up and less power gets lost and no switching obviously happens faster than some switching.
TL;DR If you randomize the data, you will constantly charge and discharge the capacitors.
I could certainly come up with alternative theories about memory compression and prefetching if we were talking about texture reads.
Hardware is different. Every operation that can be performed in hardware by a chip needs dedicated circuitry. Special casing 0 and 1 means adding at least OR reduction on each operand and a dedicated multiplexer for every bit of the output. Those transistors use power even when they're not in use (leakage power is a huge issue on modern semiconductor processes). They also degrade timing by adding more gates on critical paths through the multipliers. (The timing issue here is that all operations that happen between one flip-flop and another flip-flop need to finish within one clock cycle.) And unless there are whole blocks of 0's and 1's (this does happen in certain neural networks), you typically won't see a direct speedup anyway. In software terms, the matrix multiply is scheduled as many parallel operations that cannot be accelerated much overall by skipping a few operations in some "threads."
All of this makes zero skipping a nontrivial topic. People do still try to do it but it needs serious consideration as, depending on the application, the case is rarely one-sided.
How much die space ($) will that circuitry, that's probably statistically near zero chance for you main customers workload (who has model weight of 0 or 1!?), add. And, if you can stomach the cost, what else could you put there instead?
What percent of this hardware is running inference for ReLU models? ;)
The die area argument here makes no sense. Supporting structural sparsity can be done either by duplicating the multipliers with and without the support or you have a single general purpose multiplier that does both, in which case you can have twice as many of them.
Also, in ReLU^2 networks, 90%+ parameters are zero.
Any logic you add to the GPU is physical silicon and metal that take up physical space.
> duplicating the multipliers with and without the support or you have a single general purpose multiplier that does both
That would be extra physical logic, which would be extra physical space on the die. "can be done" isn't my point, it's that "doing requires surface area".
There was a workshop paper from SC24 that did more experiments around this I believe. I can't find it now though.
~257.5 teraflops for normal distribution, versus ~268 teraflops uniform, reported on the first graph.
I would have liked to see a straight graph of performance vs. clock speed, for each type of data. Pick your data statistics, then pick the peak performance clock speed accordingly.
And for actual runs, from a pre-run sampled curve.
Power limiting does not improve performance but it does improve efficiency. You might be able to get 90% of the performance for only 70% of the power usage, for example. It does not make the card go faster though.
This is precicely because of the efficiency. The lower efficiency of the higher speed triggers a much lower performance sooner.
This is not true unless the throttling algorithm is so broken that it's oscillating between extremes.
The parts have a curve of clock speed versus voltage. More clock speed means higher performance. That goes further up the voltage curve, meaning more power.
Throttling just moves the card further down the voltage to clock speed curve. It reduces clock speed, reducing performance.
The cards don't "perform faster by running slower". If you run the card slower, it performs slower.
That algorithm is doing exactly the task I described. If it could temporarily run faster but in a way that would cause occilation, that literally means it can run faster but it is choosing not to to preserve overall performance.
https://clehaxze.tw/gemlog/2025/04-21-programming-tensotrren...
https://clehaxze.tw/gemlog/2026/01-22-the-real-tenstorrent-t...
When you make it so the computer does not have to compute all possible states of matter it finishes faster.