Mind the cache (reference to "mind the gap" in the UK tube)
Memory read and write performance may be identical as far as the physical memory is concerned, but CPU manufacturers have designed unique hardware to manage the very different situations of reading from and writing to memory (from a program's perspective).
When reading from memory, your program normally needs the data as fast as possible and often will halt the CPU execution until the result of the read is completed. The cache logic is designed to improve the overall speed of memory by optimizing access to the current data set (a subset of the total memory). Intel (and recently ARM) have designed smart cache logic which can predict read patterns and prefetch memory into the cache in anticipation of it being needed.
When writing to memory, there normally isn't any hurry to complete the write, so the write buffer can collect a few write requests without stalling the CPU. The data will work its way through the write buffer and cache logic and only stall the CPU when a write occurs and the write buffer is full. As the program author, the question to ask yourself is - does the memory need to be written through the cache for immediate use? This is an important factor in selecting the best instructions to use. The default behavior is to write the memory "through" the cache. This means that memory which isn't already marked as resident in the cache will have to be read first (a cache line at a time), modified, and then written back to memory and kept identical (coherent) in the cache. Intel engineers were smart enough to see that the memory written may not be needed until much later, or writing it through the cache may pollute the cache (evict data that's more important to you). To handle the situation where you would rather have the data written straight to memory and skip the cache, Intel added "non-temporal" write instructions. The Intel SIMD intrinsic for the 'normal' write method:
_mm_storeu_si128
The non-temporal version:
_mm_stream_si128
So far, I haven't touched on the issue of pointer alignment. Intel has baked in extra logic to handle the situation when you read or write memory and your data straddles more than one cache line. The penalty isn't terrible, but it's best to keep your writes aligned such that only one cache line is affected by every read or write.
So how does this work out in the real world? I've added a new test to my gcc_perf project to see the performance impact of the 3 different situations:
1) Writing to an unaligned address (occasionally straddles 2 cache lines)
2) Writing to an aligned address (never straddles 2 cache lines)
3) Writing non-temporally to an aligned address
I tested a simple memset() implementation with a large buffer (16MB) to get accurate measurements. These are the average numbers I got on my MacBook Pro 15 (2.7Ghz Core i7, 2016 model):
1) Writing to an unaligned address 263ms
2) Writing to an aligned address 244ms
3) Writing non-temporally 218ms
The speed difference isn't dramatic, but it can be worth it if performance is critical to your application.
When reading from memory, your program normally needs the data as fast as possible and often will halt the CPU execution until the result of the read is completed. The cache logic is designed to improve the overall speed of memory by optimizing access to the current data set (a subset of the total memory). Intel (and recently ARM) have designed smart cache logic which can predict read patterns and prefetch memory into the cache in anticipation of it being needed.
When writing to memory, there normally isn't any hurry to complete the write, so the write buffer can collect a few write requests without stalling the CPU. The data will work its way through the write buffer and cache logic and only stall the CPU when a write occurs and the write buffer is full. As the program author, the question to ask yourself is - does the memory need to be written through the cache for immediate use? This is an important factor in selecting the best instructions to use. The default behavior is to write the memory "through" the cache. This means that memory which isn't already marked as resident in the cache will have to be read first (a cache line at a time), modified, and then written back to memory and kept identical (coherent) in the cache. Intel engineers were smart enough to see that the memory written may not be needed until much later, or writing it through the cache may pollute the cache (evict data that's more important to you). To handle the situation where you would rather have the data written straight to memory and skip the cache, Intel added "non-temporal" write instructions. The Intel SIMD intrinsic for the 'normal' write method:
_mm_storeu_si128
The non-temporal version:
_mm_stream_si128
So far, I haven't touched on the issue of pointer alignment. Intel has baked in extra logic to handle the situation when you read or write memory and your data straddles more than one cache line. The penalty isn't terrible, but it's best to keep your writes aligned such that only one cache line is affected by every read or write.
So how does this work out in the real world? I've added a new test to my gcc_perf project to see the performance impact of the 3 different situations:
1) Writing to an unaligned address (occasionally straddles 2 cache lines)
2) Writing to an aligned address (never straddles 2 cache lines)
3) Writing non-temporally to an aligned address
I tested a simple memset() implementation with a large buffer (16MB) to get accurate measurements. These are the average numbers I got on my MacBook Pro 15 (2.7Ghz Core i7, 2016 model):
1) Writing to an unaligned address 263ms
2) Writing to an aligned address 244ms
3) Writing non-temporally 218ms
The speed difference isn't dramatic, but it can be worth it if performance is critical to your application.
Comments
Post a Comment