1) Strategic pipeline lengths -- long pipelines drive throughput, short pipelines drive interrupt responsiveness. 5-stage pipelines are still popular for realtime cores.
2) Heterogenous cores -- a mix of short- and long-pipeline cores on a single chip, with some optimized for responsiveness and some optimized for throughput. (This could actually be added to the µp article as well, discussing big.LITTLE style heterogeneity with some cores optimized for total throughput and some optimized for power efficiency.) Unlike in the µp case, this is pared with a general assumption that cores are usually developer-managed (asymmetric multiprocessing) rather than magically managed by a scheduled (symmetric multiprocessing). (Dedicated cores for low power come up in µcs too.)
3) Fast memories; some very fast memories. Everything fits in SRAM on chip. Some SRAM is tightly coupled to a specific core (tightly coupled memory), which gives as fast as single cycle access; some is hanging off an AXI bus to allow sharing between cores, but adds a few cycles (and possible collisions) to access, making caches still relevant (which has not always been true for µcs). The µp developer approach to performance of "memory accesses rule everything" is not nearly as true on µcs.
4) Peripherals and accelerators dominate silicon area, and dominate system performance. (This can also be said of µps these days.) Proper use of DMA engines can completely change the solution to problems. Smart peripherals unload huge amounts of work from the core, making the core less important -- in some cases, it's really just there to configure the DMA engine and the peripherals. (This sounds an awful lot like cores on a µp just feeding GPUs these days.)
5) Topology awareness. Multiple AXI busses and peripheral busses; software needs to be aware of what peripheral or SRAM chunk hangs off what bus to maximize performance, minimize collisions, or even just to allow the peripheral to be used at all from a given core in a given power state. This has some similarities to NUMA awareness in µp development, but as with AMP vs SMP it's generally more visible to developers.
I could keep going... there's an article here.
Wish list of topics to add:
- branch predictors that can detect patterns (edit: I guess it's already covered in the paragraph about raising prediction accuracy)
- LRU-approximations in L1 caches
- Data prefetching (sequential, stride)
- Return address stack
Concerning μops, I think the 68060 did that, too.
Also, FLOPs per watt hasn't changed as much lately, so thinking in terms of watts over a few-year time horizon does at least give you a ballpark of how many FLOPs.
If you're talking about a chip, you obviously want to know about FLOPs moreso, but, even down to the level of individual rack-units, wattage is a serious concern. Not every facility is built for these crazy 200kW racks.
Modern Microprocessors – A 90-Minute Guide (2001-2016) - https://news.ycombinator.com/item?id=27014027 - May 2021 (41 comments)
Modern Microprocessors – A 90-Minute Guide (2001-2016) - https://news.ycombinator.com/item?id=18230383 - Oct 2018 (87 comments)
Modern Microprocessors – A 90 Minute Guide - https://news.ycombinator.com/item?id=11116211 - Feb 2016 (12 comments)
Modern Microprocessors: A 90 Minute Guide - https://news.ycombinator.com/item?id=7174513 - Feb 2014 (37 comments)
Modern Microprocessors: A 90 Minute Guide - https://news.ycombinator.com/item?id=2428403 - April 2011 (30 comments)