Yesterday, I got the basics going for MySQL Cluster on POWER. Today, I finished up a couple more patches to improve performance and ran some benchmarks.
This is on a 3.7Ghz POWER8 machine with non-balanced memory (only 2 of the 4 NUMA nodes have memory, so we have less total memory bandwidth than we could have, plus I’m going to bind ndbmtd to the CPUs in these NUMA nodes)
With a setup of a single replica and two data nodes on the one machine (each bound to a specific NUMA node), running the flexAsync benchmark on MySQL Cluster 7.3.7, I could get around:
- 3.2 million reads/sec
- 2.6 million deletes/sec
- 2.4 million updates/sec
- 2.4 million inserts/sec.
So, that’s at least in the right ballpark for a first go.
(I’m running this on a big endian host kernel, some random kernel I booted on the box and built with gcc 4.8 with whatever build options the MySQL Cluster cmake foo chooses by default)
New #mysql planet post : Preliminary MySQL Cluster benchmark results on POWER8 http://t.co/kv2ExQL4cv
#MySQL Preliminary MySQL Cluster benchmark results on POWER8 http://t.co/680YLWJhex
Preliminary MySQL Cluster benchmark results on POWER8 http://t.co/3WEtrO9QKv
RT @stewartsmith: Preliminary MySQL Cluster benchmark results on POWER8: Yesterday, I got the basics going for MySQL Cluster on PO… http:…
Preliminary MySQL Cluster benchmark results on POWER8 http://t.co/sz0OoBzaXR
Preliminary MySQL Cluster benchmark results on POWER8: Yesterday, I got the basics going for MySQL Cluster on … http://t.co/7j2OrWdF1g
Preliminary MySQL Cluster benchmark results on POWER8: Yesterday, I got the basics going for MySQL Cluster on … http://t.co/dSYdBK6QBy
Looking at the numbers I am pretty sure you can go much further with some simple configuration tuning. Particularly using ThreadConfig.
I’ve pretty much stuck with threads=20 for data nodes and just taken the default with 5 cores of a 10 core socket (it’s a dual chip module, so there’s actually two NUMA nodes per socket) with SMT=4 (POWER8 can have 1 to 8 threads per core, set dynamically at run time) and this seems to be better than other combinations I’ve tried (I haven’t been exhaustive though). I’m getting about 16/20 cpu threads being used in that configuration. But as I mentioned, this machine doesn’t have balanced memory, which also means we don’t have as high memory bandwidth as we could have… so I suspect there’s a long way to go yet.
I assume by threads=20 you mean that you set MaxNoOfExecutionThreads=20. This means you have 8 LDMs, 5 TC threads, 2 Send threads and 3 receive threads.
Some double checks: to actually get the 8 LDMs do real work you also need to set NoOfFragmentLogParts=8.
My personal experience is that one should try to use as little hyper threading on the LDM threads as possible. 2 threads per core usually give a bit extra, but wouldn’t expect any major benefit going higher than that for LDM threads. For other thread types they usually do better with more hyper threading (or whatever the term in POWER is :))
I usually also use ThreadConfig with cpubind to be able to use top so that I can see which thread is the bottleneck. I usually try to get the LDM threads to become the bottleneck.
Also
How do the results compare to that of
1) MySQL running on the same hardware
2) MySQL Cluster running on a “comparable” Xeon machine? (btw what would be a comparable Xeon to this POWER8 chip?)
comparing to MySQL on the same hardware? MySQL Cluster easily does about an order of magnitude more key reads/second.
Compared to Xeon? I’ll let others run that benchmark :)
MaxNoOfExecutionThreads=20, yes. Going either side of that didn’t make any (positive) difference. I’m pretty sure it was NoOfFragmentLogParts that ndbmtd was complaining about not being set to 8 when I started testing (I’ll have to test “properly” at some point rather than just hacking mtr configuration files for some preliminary numbers :)
The general rule seems to be that SMT up to SMT4 “always” helps while SMT8 can be less helpful… but I think the idea of doing some pinning and seeing what’s going on could be good… I’m not convinced that I’m not just out of memory bandwidth currently though.
For benchmarking with MySQL Cluster I recommend using the dbf-0.37.50 trees that I made available on dev.mysql.com. This can be used both for small configs and configs with 100s of servers.