The first three parts of this series walked through optimizing dot products and matrix-matrix multiplication for the cache. The general guidelines for cache optimization are partitioning, proximity, load speed, and load frequency.
The calculation must be divisible into subcalculations. Both the dot product and matrix-matrix multiplication calculations can be divided. Dot product can be done in groups and matrix-matrix multiplication can be done by multiplying submatrices.
The data must be divisible into blocks. The blocks must be sized such that as many as needed fit into cache simultaneously. Both examples were readily divisible. The dot products were divided into three element blocks. The matrices were divided into 3×3 submatrices. Two three element dot product blocks fit into cache at once, as did two 3×3 submatrices.
Partitioning the data into blocks isn’t enough. When partitioned, the elements necessary for the divided calculation must all lie within the same block. Both examples had good proximity. The three element dot product blocks contained all the elements needed for that part of the dot product. The 3×3 submatrices for matrix-matrix multiplication contained all of the elements necessary for their multiplication.
Moving the data from the blocks into cache must be fast. This can be done in two ways. The data can be stored in RAM in structures designed so the memory manager automatically loads a block into cache. This requires using specialized data structures and may cause problems when moving data from one system to another.
The alternative is to copy a block’s elements from RAM into a special data structure the memory manager automatically loads into cache. Writing a high speed data copy routine is not trivial. Many languages do not have the features necessary to write one. Writing the routine in one language and calling it from another involves overhead. This overhead may slow the call down more than writing it in another language sped it up.
The dot product example had great load speed. The elements were in one-dimensional arrays. The memory manager automatically loaded the elements into cache in three element blocks.
The matrix-matrix multiplication calculation, as written, has horrible load speed. While the code may think of the data as being in submatrices, the elements in RAM are still in a giant, row-major, one-dimensional array. The data structure must be modified to store the elements in actual submatrices or special code will be needed to copy the submatrix elements from RAM into cache. Please see Implementing a High Performance Matrix-Matrix Block Multiply for one way of storing the matrix elements in submatrices.
The final piece is ensuring the blocks as a group are loaded into cache as few times as possible. This requires looking at the big picture. Minimizing loads for a part of the calculation may prevent the calculation as a whole from being as fast as possible.
The dot product calculation automatically had the lowest possible load frequency. Each three element block was only needed once and was only loaded into cache once. Unfortunately, knowing how to optimize one part of a calculation may result in suboptimal performance for the calculation as a whole.
In the first matrix-matrix multiply, using the optimized dot product routine resulted in inefficient cache use. Each block of a row of A was needed multiple times and was loaded multiple times. The second optimization changed the order in which matrix-matrix multiplication was calculated, which resulted in each block of A being loaded only once. The third optimization also did this.
The only way to approach the maximum calculation potential of a CPU is by optimizing the calculation for the cache. The high clock speeds of modern CPUs have made RAM access prohibitively expensive. Even non-SIMD calculations written without cache optimization run more than 10x slower than a cache-optimized version.
Running an inefficient calculation on many CPUs is often considered a better solution than cache optimization. While this may raise throughput, latency remains an issue and increases the cost of mistakes. Unknowingly tainting a 10 minute calculation costs 10 minutes. Unknowingly tainting just one of ten 100 minute calculations costs 100 minutes.
SIMD optimization is predicated on cache optimization. The pinnacle of CPU performance is out of reach for those who do not optimize for the cache. The difference in performance between a SIMD-optimized calculation and one not optimized for the cache can be a factor of 1000. This is the difference between taking over eight hours and taking five minutes.
Following the guidelines above may result in massive speed improvements for a code. If a programmer plans on using SIMD, cache optimization is the first step. However, cache optimization has some perils.
Before doing any cache optimization, a comprehensive test system must be available. If a tested, known-good reference routine is not available, start with the simplest implementation of the calculation. Once that is well-tested, use it as a validator. Cache optimization often involves data structures that create fencepost and/or symmetry problems. Tests must cross over partition boundaries, use non-integer multiples of block sizes, and so forth.
Partitioning schemes must take into account the nature of the data. Does the data naturally partition itself into the chosen block size? Will it continue to do so? Does partitioning happen invisibly to the end-programmer? Or do they have to code for it? All of these questions potentially limit the portability and lifespan of the code.
Cache-optimized calculations will be mathematically identical to the original calculation. But they may not have an identical number of operations. When working with floating point values, this may result in small differences between the two calculations. Additionally, care must be taken to avoid significant digit problems. If the code must match a reference calculation exactly, cache optimizations may be limited or impossible.
Intentionally picking the order of operations in a calculation allows the best optimization the programmer is capable of envisioning. Unfortunately, this increases development time and limits the optimization to what the programmer can conceptualize. Whether this level of optimization is better or worse than what the compiler would produce depends on the compiler and the programmer. Additionally, some compilers may produce better output than a programmer only when “tricked”. Learning how to “trick” the compiler takes time and these tricks may change between minor releases of the compiler. Whether it is better to optimize the code by hand or play compiler games depends on the programmer, the compiler, and the situation.
Different CPUs use different cache sizes. As of 2013, almost all x86-compatible laptop, desktop, and server CPUs use 32 KB L1 data caches. Many tablet and phone CPUs use 16 KB data caches. Some older ones use 8 KB data caches. Whether this is an issue depends on the purpose of the code.
If data partitioning is done via a custom data structure, custom support routines will be needed. Using generic support routines via indexers or mapping objects may leave the system with one amazingly fast calculation and many more extremely slow ones. Writing and testing custom support routines and converters is non-trivial.
This series deals with optimizing for the L1 data cache. When writing code destined for SIMD optimization, this may not be enough. Optimizing beyond the L1 cache is beyond the scope of this article. To get an idea of what is involved, please follow the links from Kazushige Goto’s Wikipedia entry.
Please continue on to the performance comparisons.