1/29/2024 0 Comments Fp64 props![]() Now, with high core counts and high bandwidths, some parallelism is absolutely required to even begin to approach the performance ceiling. In the past, this improved efficiency by eliminating the per-connection overhead. Very often, you'll see n-tier applications where for some reason (typically load-balancers), the requests are muxed into a single TCP stream. High capacity doesn't equal low latency, and scalability to many users doesn't necessarily help one user get better performance! Unfortunately, due to the DeWitt clauses in EULAs, this would be risky to publish. It would be an interesting exercise comparing the various database engines to see what their latency overheads are, and what their response time is to trivial queries such as selecting a single row given a key. This is on a 4 GHz computer, so in practice this is the "ultimate latency limit" for SQL Server, unless Intel and AMD start up the megahertz war again. Neither shared memory nor named pipes had any benefit. Something I found curious is that no matter what I did, the local latency wouldn't go below about 125 μs. As far as I know, they use Mellanox adapters. Using Microsoft's "Latte.exe" testing tool, I saw ~50 μs in Azure with "Accelerated Networking" enabled. Of course, the underlying TCP latency is significantly lower. It doesn't matter how fast the packets can get on the wire if the application can't utilise this because of some other bottleneck. This is the relevant metric, as it represents the performance ceiling. The test was to run "SELECT 1" using the low-level ADO.NET database query API in a tight loop. So you end up with N users being muxed onto just one back-end connection which then becomes the bottleneck. This matters when "end-to-end encryption" is mandated, because back end connections are often pooled. Their SSL cards are designed for improving the aggregate bandwidth of hundreds of simultaneous streams, and don't do well at all for a single stream, or even a couple of concurrent streams. HTTPS accelerators such as F5 BIG IP or Citrix ADC (NetScaler) are actually HTTPS decelerators for individual users, because even hardware models with SSL offload cards can't keep up with the 1 GB/s from a modern CPU. ![]() You have to parallelise somehow to get the full throughout (or drop encryption.) If you have a high-spec cloud VM with 40 or 50 Gbps NICs, there is no way a single HTTPS stream can saturate that link. Newer CPUs, e.g.: AMD EPYC or any recent Intel Xeon can do about 1 GB/s/core, but I haven't seen any CPUs that significantly exceed that. ![]() Similarly, older CPUs struggle to exceed 250 MB/s/core (2 Gbps) for AES256, which is the fundamental limit to HTTPS throughput for a single client. Now we're down to 800 queries per second! Typical networks have a latency floor around 500-600 μs, and it's not uncommon to see 1.3 ms VM-to-VM. Even with SR-IOV and Mellanox adapters, that rises to 250 μs if a physical network hop is involved. Local latency to SQL Server from ASP.NET is about 150 μs, or about 6000 synchronous queries per second, max. Think latency to the database and AES256 throughput. I recently did a bunch of tests to see what the "ultimate bottlenecks" are for basic web applications. They're designed to execute many threads, instead of speeding up a single thread (which happens to be really good for databases anyway, where your CPU will spend large amounts of time waiting on RAM to respond instead of computing). They can handle 4-threads with very little slowdown, and the full 8-threads only has a modest slowdown. etc.) the x86 is superior.īut as far as being a thin processor supporting as much memory bandwidth as possible (with the lowest per-core licensing costs), POWER9 / POWER10 clearly wins.Īnd again: those SMT8 cores are no slouch. For compute-heavy situations (raytracing, H265 encoding, etc. On the other hand, x86 has far superior single-threaded performance, far superior SIMD units, and is generally cheaper. You get the fewest cores (license-cost hack) with the highest 元 and RAM bandwidths, with the most threads supported. If we're talking about "best CPU for in-memory database", its pretty clear that POWER9 / POWER10 is the winner. With 8 memory controllers at 2666 MHz, we're at 170 GBps bandwidth on 64-threads. With 6-memory controllers at 2666 MHz (21GBps per stick), that's 126 GBps bandwidth.ĪMD EPYC really only has 16MB 元 per CCX (because the other 元 cache is "remote" and slower than DDR4). Intel Xeon Platinum 8180 only has 38.5MB 元 across 28 cores / 56-threads. That's unified 元 cache by the way, none of the "16MB/CCX with remote off-chip 元 caches being slower than DDR4 reads" that EPYC has. The 12-core POWER10 will have 120MB of 元 cache and 1000GBps main-memory bandwidth with 96-threads.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |