Reputation: 21
I have a multi-core system with 4 cores each of them having private L1 and L2 caches and shared LLC. Caches have inclusive property meaning that Higher level Caches are super-set of lower level Caches. Can I directly flush a block on the LLC or does it have to go through the lower level first?
I am trying to understand flush+ reload and flush+flush Cache side Channel attacks.
Upvotes: 2
Views: 1254
Reputation: 3101
That couldn't be true that CLFLUSH always evicts from every cache-level. I just wrote a little program (C++17) where flushing cachlines is always below 5ns on my machine (3990X):
#include <iostream>
#include <chrono>
#include <cstring>
#include <vector>
#include <charconv>
#include <sstream>
#include <cmath>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__)
#include <x86intrin.h>
#endif
using namespace std;
using namespace chrono;
size_t parseSize( char const *str );
string blockSizeStr( size_t blockSize );
int main( int argc, char **argv )
{
static size_t const DEFAULT_MAX_BLOCK_SIZE = (size_t)512 * 1024;
size_t blockSize = argc < 2 ? DEFAULT_MAX_BLOCK_SIZE : parseSize( argv[1] );
if( blockSize == -1 )
return EXIT_FAILURE;
blockSize = blockSize >= 4096 ? blockSize : 4096;
vector<char> block( blockSize );
size_t size = 4096;
static size_t const ITERATIONS_64K = 100;
do
{
uint64_t avg = 0;
size = size <= blockSize ? size : blockSize;
size_t iterations = (size_t)((double)0x10000 / size * ITERATIONS_64K + 0.5);
iterations += (size_t)!iterations;
for( size_t it = 0; it != iterations; ++it )
{
// make cachlines to get modified for sure by
// modifying to a differnt value each iteration
for( size_t i = 0; i != size; ++i )
block[i] = (i + it) % 0x100;
auto start = high_resolution_clock::now();
for( char *p = &*block.begin(), *end = p + size; p < end; p += 64 )
_mm_clflush( p );
avg += duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();
}
double nsPerCl = ((double)(int64_t)avg / iterations) / (double)(ptrdiff_t)(size / 64);
cout << blockSizeStr( size ) << " " << nsPerCl << "ns" << endl;
} while( (size *= 2) <= blockSize );
}
size_t parseSize( char const *str )
{
double dSize;
from_chars_result fcr = from_chars( str, str + strlen( str ), dSize, chars_format::general );
if( fcr.ec != errc() )
return -1;
if( !*(str = fcr.ptr) || str[1] )
return -1;
static const
struct suffix_t
{
char suffix;
size_t mult;
} suffixes[]
{
{ 'k', 1024 },
{ 'm', (size_t)1024 * 1024 },
{ 'g', (size_t)1024 * 1024 * 1024 }
};
char cSuf = tolower( *str );
for( suffix_t const &suf : suffixes )
if( suf.suffix == cSuf )
{
dSize = trunc( dSize * (ptrdiff_t)suf.mult );
if( dSize < 1.0 || dSize >= (double)numeric_limits<ptrdiff_t>::max() )
return -1;
return (ptrdiff_t)dSize;
}
return -1;
}
string blockSizeStr( size_t blockSize )
{
ostringstream oss;
double dSize = (double)(ptrdiff_t)blockSize;
if( dSize < 1024.0 )
oss << blockSize;
else if( dSize < 1024.0 * 1024.0 )
oss << dSize / 1024.0 << "kB";
else if( blockSize < (size_t)1024 * 1024 * 1024 )
oss << dSize / (1024.0 * 1024.0) << "MB";
else
oss << (double)blockSize / (1024.0 * 1024.0 * 1024.0) << "GB";
return oss.str();
}
There's no DDR-whatever memory that can handle flushing a single cacheline below 5ns.
Upvotes: 0
Reputation: 364428
clflush
is architecturally required / guaranteed to evict the line from all levels of cache, making it useful for committing data to non-volatile DIMMs. (e.g. Battery-backed DRAM or 3D XPoint).
The wording in the manual seems pretty clear:
Invalidates from every level of the cache hierarchy in the cache coherence domain ... If that cache line contains modified data at any level of the cache hierarchy, that data is written back to memory
I think if multiple cores have a line in Shared state, clflush
/ clflushopt
on one core has to evict it from the private caches of all cores. (This would happen anyway as part of evicting from inclusive L3 cache, but Skylake-X changed to a NINE (not-inclusive not-exclusive) L3 cache.)
Can I directly flush a block on the LLC or does it have to go through the lower level first?
Not clear what you're asking. Are you asking if you can ask the CPU to flush a block from L3 only, without disturbing L1/L2? You already know L3 is inclusive on most Intel CPUs, so the net effect would be the same as clflush
. For cores to talk to L3, they have to go through their own L1d and L2.
clflush
still works if the data is only present in L3 but not the private L1d or L2 of the core executing it. It's not a "hint" like a prefetch, or a local-only thing.
In future Silvermont-family CPUs, there will be a cldemote
instruction that lets you flush a block to the LLC, but not all the way to DRAM. (And it's only a hint, so it doesn't force the CPU to obey it if the write-back path is busy with evictions to make room for demand-loads.)
Upvotes: 3