A friend of mine is working on a custom memory controller for specialized devices at Google Cloud. The software people are giving the hardware people tremendous pressure in making sure the memory controller works well with all common powers-of-two strides, besides random strides.
There are some (very primitive by software standards) "hash functions" implemented in hardware to make that work. IIRC at that time they treat the relevant bits in an address as a vector in GF(2) and multiply by a certain hardcoded invertible matrix. (Of course there are properties for that matrix besides being invertible.) The idea is that if a single bit in the input address changes, all output bits should have equal probability of flipping and not flipping (the avalanche effect).
DRAM is using hardware technology that is hard. Each dram bit can be addressed by ranks, banks, chips rows and columns. Depending on stuff, you typically read/load a whole cache line from DRAM. This is nice and easy. However, AFAIU there is delay needed in the chip after reading bytes. This means a read to subsequent cache line would be slow. Therefore, there is a strong reason to lay out subsequent data far away, ideally on different chips.
So you can read cache line #0 that can be on some physical chip #A and if you read line #1 it should not be on chip #A but ideally on a "random" other physical die.
This mapping, of "physical memory address" into actually where the data is on the DDRAM, is very much a hash function. This is usually done by the CPU manufacturer I guess. See the reversing of sandy bridge algo:
Rowhammer is an attack, where the attacker targets a single physical chip of DRAM and cause its misbehaviour due to keeping it active in unusal patterns. RowHammer is basically about reversing that hashing and exploiting the shortcuts dram manufacturers took when producing the chips.
But this is about DRAM. I don't know how it works in Cache, but I don't think there is a special hashing algo. It just uses middle bits to address the cache line, as the parent article says.
It is related because a naïve program will access a device (CPU registers, instruction pipelines, indirect cache access, RAM, disk, optical disk) in a linear fashion. RAM no longer has linear or near linear access. Disk drives of all types dealt with rotation speed, spindle head speed, etc. For a long time, disk specs were recorded in linear/sequential access time, full seek access time, and average random access time until sequential access time became less than 1ms and OSes/disk cache programs made average random access time the only latency worth caring about.
In summary, most storage devices have hierarchical performance and it’s worth caring about how to optimize for that.
The Power7 (2010) already had a very powerful prefetching engine (also programmable) able to detect a multitude of patterns/striding supporting even 12 different streams. Its design, composability and not only automatic detection but its programmability, allowing you to drive its behavior or change according to different computation phases made a fantastic piece of engineering for its time: https://www.redbooks.ibm.com/redbooks/pdfs/sg248079.pdf
Just pointing this out (there are several generations of Power's with improved design) as you mentioned memory controllers. The problem in the blog post is different (aliasing) and I guess that all I can say is that unless some randomness is introduced, every design will have a "worst" access pattern.
Under the hood, this hashing has existed for a long time, certainly since Haswell at Intel. However, Intel designed it to make sequential access still mostly sequential, and combined it with prefetch logic that is decent at detecting strided access if you poke it the right way.
Also, main memory has a different hash than L3 cache does, and this hash is designed for sequential access.
Can I read more about this somewhere? On Haswell I've observed a 5% speedup (for a whole program that mostly does a bunch of other work) from spacing a bunch of arrays that are typically all pretty small and all accessed consecutively all the way through to have their start positions differ by some amount of cache lines other than a power of 2. So it certainly seemed to me like I was working with a set-associative cache where power of 2 strides were pathological.
This seems pretty wasteful compared to the alternative of writing software that does not use power-of-two strides and deleting this part of the hardware!
Rewriting software is often significantly more expensive (or, not even possible when dealing with third party code!) whereas with hardware you can fix it once for everyone and see a large speed up across the entire system.
Do specialized devices at Google Cloud run all sorts of customer software? Do the customers typically target other CPUs that have this extra stuff in their cache?
I can't tell you how many large-power-of-two stride loops I've seen. I'm going to propose that CPU (or compiler) makers make sure they aren't 10x slower.
It's not a power of two stride loop that's the problem, it is a sparse power of two access pattern. Or more accurately, it's the the cache lines congruent modulo the cache size divided by associativity that you don't access that is unused capacity in your cache.
Computers are designed almost exclusively around the idea of spatial and temporal locality of access, so it turns out this kind of access pattern doesn't happen all that much, and when it does it's often suboptimal code that's not worth a CPU designer trying to optimize much for anyway. At least not in the primary caches where indexing delay is critical so anything but power of two is prohibitive. It is good to be aware of though, and occasionally it can be noticeable and come in use.
I remember I found an issue years ago we had a supercomputer and the kernel was handing out memory to each process in powers of two when they were allocating memory, because it's memory allocator had a per-CPU magazine that would refill from a larger shared pool in batches, and those batch sizes were powers of two (with powers of two allocation sizes).
So each process would only be given memory that ever indexed into like 1/2 or 1/4 of the cache on its CPU. Other styles of workloads never noticed because you would have lots of irregularity, things allocating and freeing and context switching and all different activity on different CPUs. But on supercomputers you often have a pretty clean memory space with little other activity, and work kicks off by starting off ~identical processes on all CPUs and each allocate large chunks of memory.
From memory it was making the magazines non-power-of-two and some very simple page coloring heuristics. I'm sure it wasn't very optimal and many a PhD could still be had studying allocators, but it got the acceptance testing over the line.
Except in order memory access is what modern CPUs aim for (and... basically every computation device be it CPU or GPU for the last 30+ years)
Strided access has always been slow, on CPUs, on GPUs, on everything, for the last 30 years. Its a known issue. "Fixing" strided memory means breaking other things (such as in-order memory access). And that's just not worth it.
-------
This access pattern (strided memory access) leads to channel conflicts, bank conflicts, and other uneven memory accesses in GPU space.
This entire behavior around associative caches (CPUs) or channel/bank conflicts (GPUs) is the convention, and has been for decades. I don't see any good reason to change.
If it really is a problem for your code, then you can layout the memory differently in your code. (Ex: GPU programmers layout their structures to be non-power of 2, with innocuous padding in their structures to help mitigate this channel conflict problem)
------
At this point, you're gonna end up breaking all the optimized code for the last 30 years to fix slow code (that always has been slow, and therefore probably doesn't need to be optimized). That's just a bad move.
> Except in order memory access is what modern CPUs aim for
> Strided access has always been slow, on CPUs, on GPUs, on everything
Yes and no. It's not as if hardware engineers completely stuck our heads in the sand and just let the performance be shitty, they've actually developed very sophisticated prefetchers (in this case, literally called a "stride prefetcher"). Many modern stride prefetchers can detect and launch concurrent lines faults at strides of over a megabyte. Here, we're explicitly exploiting that line faults _do not_ need to happen in order thanks to memory ordering rules provided by the ISA.
While you will still get crumby cache utilization due to hammering a single set, if you are legitimately just going once through a loop and operating on single items (i.e. you're streaming and not returning to items once accessed like in the article), the prefetcher will make a huge difference and get you a decent hit rate.
Not to mention, there are good reasons for using powers of two, such as vectorizability.
If the cache is vulnerable to "bad access patterns" when the "bad" access patterns are the obvious go-to access patterns that anyone would go for, then it's the cache's fault. Even just XORing the address with a hardcoded, pseudorandom bit pattern would have alleviated the issue.
> Even just XORing the address with a hardcoded, pseudorandom bit pattern would have alleviated the issue.
Erm, no it wouldn't. Two addresses with identical index bits will still collide after you XOR those identical indexes with the (same) random pattern.
You would need to do some bit mixing. The simplest would be to multiply the non-offset bits by a prime. But a multiply is still a lot slower than physically wiring the index bits to the mux.
The problem is a little more nuanced than the article suggests. Notice that you don't actually need any of the data to stay in the cache during the loop—you read and write an address, then never return to it again. So the fact that you overflow your associativity doesn't specifically matter; you're happy to have it evict any or every entry.
I often get these things wrong, but I believe what's happening is that it's getting blocked on the writeback from the cache evictions (because they're now dirty from the fresh write). Which means that the 256 vs 257 distinction is a tiny bit fake; if you have a large enough array, both will eventually slow down to at best the speed of the next lower cache layer. The 257 stride will postpone the reckoning longer if the cache starts out partly clean, since the writes will just pile up in the cache for a while. (Shorter version: the 257 stride will indeed probably go faster, but it's probably better to think of the effective size of the cache rather than thinking of it as just faster vs slower.)
The article carefully sets N to 2*21 to demonstrate the behavior. (The overall conclusion is still correct!)
I suspect that the article is using a hot cache, and that the slowdown wouldn’t occur with a cold cache. The article implies that it is using a hot cache by saying “The array stops fitting into the L3 cache” :
Since the cache system uses the lower 6 bits for the offset and the next 12 for the cache line index, we are essentially using just 2^(12−(10−6)) = 2^8 different sets in the L3 cache instead of 2^12, which has the effect of shrinking our L3 cache by a factor of 2^4 = 16. The array stops fitting into the L3 cache (N = 2^21) and spills into the order-of-magnitude slower RAM, which causes the performance to decrease.
Even just XORing the address with a hardcoded, pseudorandom bit pattern would have alleviated the issue.
Yeah, that's in the same vein as Seznec's 'skew-associative' cache where a cache line would be either direct mapped or mapped based on some hash of its address
I’m confused by the cache graphs: it looks like the direct-mapped cache has multiple lines drawn from a single cache line to different blocks in main memory.
From what I understand, a hardware cache wouldn’t store a line multiple times in main memory.
The direct mapped cache graph is just showing that each cache line is mapped to multiple addresses.
So in that direct mapped example blocks 0, 64, and 256 are all mapped to cache line 0, and you can only have one of them in cache at any given moment. If block 0 is in cache and block 256 gets accessed, block 0 will be evicted from cache because block 256 will replace it in cache line 0.
I see. I was interpreting the line as “is stored in the cache from” when it really means “can be stored in the cache at”. Using directed arrows might help some. Thanks!
A connecting line there means that that the data from that memory location can be stored in that cache line. In the fully associative cache any memory location can use any cache line. In the direct-mapped cache each memory location has one designated cache line it must use (but each cache line has multiple main memory locations associated with it).
This has been a very helpful and explained very clearly and directly. Thank you. I'm having trouble keeping up with some of the comments in this thread but this one has been very illuminating.
Old Memory from the 1990's: Some high-end (back then) CPU had large (for then), fast L2 caches...which were merely 2(?)-way associative. But to mitigate the performance hit from cache thrashing, it also had a small-ish 64(?)-way associative "helper" D-cache.
(It might have been HP PA-RISC, from an article in Byte Magazine - but that's digging deep into the old memory murk.)
I have a(nother) question. If you’re iterating through a long array with chunksize larger than the cache line size, then every iteration of the loop will cause a cache miss. So it shouldn’t matter if it’s a power of two or not, every call to a[i]++ will be a cache miss. What gives?
I think it only makes a difference if you ever want to go back to previous elements, like to iterate through the array again.
Then it helps to have it in cache.
Otherwise if you never go back (if you just iterate once), the best strategy should be to not store it in cache at all so that other things you have in cache can stay.
There are some (very primitive by software standards) "hash functions" implemented in hardware to make that work. IIRC at that time they treat the relevant bits in an address as a vector in GF(2) and multiply by a certain hardcoded invertible matrix. (Of course there are properties for that matrix besides being invertible.) The idea is that if a single bit in the input address changes, all output bits should have equal probability of flipping and not flipping (the avalanche effect).