The following sentence is brought to you by IBM Legal: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
My previous post covered the work needed to get MySQL 5.6.17 running reliably on modern POWER systems. The patch to MySQL 5.6.17 that’s needed is available here.
For those who don’t know, POWER8 is the latest Power Architecture processors from IBM (my employer). These chips will be available in systems from IBM in June 2014 (i.e. Real Soon Now(TM)). There’s some fairly impressive specs and numbers (see Wikipedia and elsewhere) – but what could this mean for actual applications?
Well, it turns out that MySQL is a pretty big thing in some target markets for POWER8, and inspired by Dimitri’s impressive benchmark numbers, I thought we should have a go on POWER8.
Firstly, I focused on MySQL 5.6 as it is the current stable release. MySQL 5.7 will be the subject of a future blog post.
The first step was to ensure that MySQL 5.6 worked correctly on POWER. My previous blog post covered the few bugs I ran into and filed (often with patches). This wasn’t too hard and I’m fairly confident the bug fixes are simple enough to get into MySQL 5.6 – I can’t comment on what would be/could be “officially supported”, that’s a business discussion :)
In order to ensure that my patch was not only correct but performing well, I needed a benchmark. For my initial benchmark. I chose sysbench point selects (i.e. read only key lookups), which should show the theoretical maximum queries per second you could pump through the MySQL Server as well as really stressing the mutex code, helping ensure it was not only correct, but performing well.
A simple comparison of my early patch that used heavyweight memory barriers versus Yasufumi’s patch that used more lightweight ones showed that using heavyweight barriers could be as much as a 50% performance hit – so getting this code right is important.
To add to the fun, the POWER8 processor has a few parameters you can tweak. There is the SMT mode, which dictates how many threads per core there are. This can be changed at runtime. You can be in SMT=off, SMT=2, SMT=4 or SMT=8. Typically, only some workloads can benefit from SMT8 rather than SMT4. There is also DSCR, which is data prefetching. For sysbench point selects, I’ve found we do slightly better (around 10%) when DSCR is set to 1 rather than zero – but YMMV on other benchmarks.
In my experiments, I’ve found that SMT4 or SMT8 seems to be the best bang for buck for MySQL workloads on POWER8. With SMT=2 rather than off, I’ve seen a ~50% performance boost in sysbench point select results. With SMT=4 I’ve seen another 50% boost (i.e. roughly double SMT=off performance). The benefit of SMT8 for MySQL 5.6 (and the 5.6 part is crucial here) may be minimal, especially for this benchmark. This is mostly due to hitting heavy mutex contention inside the MySQL server rather than anything else.
POWER8 systems come in either single or dual socket, with the number of cores being a total of 4, 6, 8, 10, 12, 16, 20 or 24 depending on configuration of the system (go check IBM web site for specifics of what’s available in what model). This means with SMT8, a dual socket, 24 core POWER8 system has 192 hardware threads – the system I was using for these benchmarks.With this number of cores and hardware threads, those familiar with MySQL on multi core systems may already have an inkling that using the full capacity of such a system may be hard for MySQL.
Certainly for old versions of MySQL (such as 5.0 or 5.1) you’re going to get nowhere near full system utilization on POWER8. For MySQL 5.6 (and in the future, 5.7) you have a much better hope.
Before anyone asks, yes, I used jemalloc for most of my benchmarks and it helps by giving a single digit percent performance increase (around 3-4%).
The bottlenecks inside MySQL 5.6 for sysbench point select workload are fairly well documented, so at best we may be striving to equal the performance of other CPU architectures rather than get too much higher simply due to hitting mutex contention in creating read views inside InnoDB. So the maximum performance will be a function of individual core CPU speed and the speed at which a lock can be acquired (i.e. related to how quick you can bounce a cacheline with a lock between cores).
This is exactly what I found on POWER8 with MySQL 5.6 – you hit the same bottleneck on POWER8 as you do everywhere else – creating read views in InnoDB.
That being said, my maximum sysbench point select results on POWER8 was 344kQPS. This not only matches but exceeds the previous record holder by quite a decent amount.
This number was across 8 tables with mysqld bound to a single NUMA node (6 cores) and sysbench bound to another NUMA node (6 cores) on the same socket. For this benchmark, due to the mutex contention, bringing the second socket into play didn’t improve performance. For other benchmarks, (e.g. standard sysbench read only) it seems to scale with more CPU cores much better (no doubt the subject of a future blog post).
Single table sysbench point select was also impressive at 335kQPS – you only got an additional 10kQPS by going to 8 tables! All of these results were with SMT4 and DSCR=1, which seems to be the best configuration for this type of workload.
Up next: MySQL 5.7 on POWER8.