Practically every programmer, and most other computer users as well, know that CPU cache is that small but incredibly fast memory right inside or next to the CPU, which keeps copies of frequently accessed data, so that the CPU must not reach out to the slower main memory that often. Cache is good – the more cache, the better! But how does CPU cache actually work? And what are its effects on different ways to program?
For many programmers, the CPU cache is magic: It somehow just works and makes our machines faster. We know that its effectiveness depends on algorithm and data structure choices, but we typically do not really understand the involved criteria. So we simply code along and hope for the best, maybe doing some trial-and-error optimization if our benchmark results look lousy. At least that is how I dealt with this problem so far. After reading a bit about the workings of CPU cache on Wikipedia, I decided to have a closer look into this and found several good resources on the Web.
For a programmer, who likes to do some experimenting, Igor Ostrovsky’s “Gallery of Processor Cache Effects” is possibly the most entertaining entry into this subject. Ostrovsky starts off with a simple experiment: Two loops are programmed to iterate over the same large array, and thereby perform some trivial manipulation (multiplication by 3) on the array elements. The only difference between the two loops is that in one case all array elements are manipulated, while in the other case only every 16th element is manipulated, while the others are simply skipped over. In theory, that second loop should run about 16 times faster, right? Well, on most modern machines (just confirmed this on my MacBook), the two loops use almost exactly the same amount of time. The reason of course is, that in both cases the same amount of memory is swapped in and out of the CPU cache – and memory access is the clear performance bottleneck for this problem.
Next, Ostrovsky modifies his example by varying the step size in the second loop and graphs the results to show a sharp performance drop-off beyond a step size of 16. This relates to an in-memory step size of 64 bytes (he’s looping over 32-bit integers) – exactly the length of a cache line in most x86 family processors. I knew about cache lines before, but never imagined that such a simple experiment could so clearly demonstrate their effect and even be used to accurately determine the cache line length. Further experiments in the article determine the sizes of the L1 and L2 caches, demonstrate the presence of instruction-level parallelism, and explore the concepts of cache associativity and false cache line sharing.
Did this wet your appetite to learn more about CPU cache effects? Well, I found another resource that dives much deeper into this subject. But be warned: this one is heavy stuff. Ulrich Drepper of Red Hat Inc. has written an excellent paper on “What Every Programmer Should Know About Memory”. It covers many memory related topics, but mostly focuses on CPU cache. With 114 pages I consider this paper more of a textbook. No, I have not yet read it all… But I will definitely keep that bookmark handy to answer all of my upcoming computer memory related questions. While checking this paper out, also take a look at the other papers and articles on Drepper’s home page – he has written some really useful stuff.