1 million SQL Queries Per Second: MySQL 5.7 on POWER8

I’ve previously covered MySQL 5.6 on POWER (with patch), MySQL 5.6 Performance on POWER8 (spoiler: new performance record) and MySQL 5.7 on POWER.

Of course, The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions. Also, these numbers should be considered preliminary, but trust me – I did get them and it’s not April 1st.

From my last post, you saw that with my preliminary patch for MySQL 5.7 to work on POWER, we could easily match the previous record for sysbench point select queries per second (i.e. key lookups). In fact, we could exceed the published record by a little bit which is kind of nice. At around 630kQPS, one could be rather happy.

But we still had 30-40% idle CPU on POWER8. This led me to file the following bug report:

  • Bug 72829: LOCK_grant is major contention point, leaves 30-40% idle CPU.

What’s going on is that there’s a rwlock in the MySQL Server that ensures that writers don’t collide with readers to the data structures describing the GRANTs (i.e. who has access to what). If you run a GRANT statement, it gets a writer lock, and nobody can read (i.e. check permissions) while everything is being updated. If you run a normal SQL statement, you get a read lock (non-exclusive) and can check permissions appropriately.

It’s been known for a long time that LOCK_grant was a bottleneck. Typically, some people have run with skip-grant-tables to help shorten the time the lock as held (as in MySQL you still take the mutex even though you’ve started the server with skip-grant-tables).

In Drizzle, we fixed that – moving authentication and authorization completely behind plugin APIs and if you didn’t load plugins for them, you executed near enough to zero instructions that it didn’t matter.

In my experiments, enabling skip-grant-tables actually hurt performance rather than helped. More investigation is needed, but it seems that simply the act of acquiring and releasing the rdlock is now a major bottleneck in some benchmarks (such as sysbench point select).

It turns out that this is a well known problem in other pieces of software (e.g. Linux kernel) and is pretty much what RCU (Read Copy Update) is best at. As far back as 2006 I remember attempting to get my head around RCU so that one day we could use it in MySQL or MySQL Cluster.

Another simpler method is simply splitting the mutex, with readers able to acquire any one of N mutexes and writers needing to acquire them all. This penalizes writers, but unless you’re executing a lot of GRANTs, you’re probably safe.

So… what is the theoretical maximum performance if this bottleneck went away?

I wrote a quick patch that just commented out the rdlock acquisition of LOCK_grant in the hot codepath of sysbench point selects. I wasn’t running GRANT statements at runtime so this was “safe”.

This patch is not production ready, it’s merely useful for demonstrating where we could be with MySQL 5.7 on POWER8 if one last bottleneck is fixed.

My results? Slightly over ONE MILLION QUERIES PER SECOND!

This is roughly twice the previous record.

This is with a dual socket 24 core POWER8 with SMT8 and DSCR=1 on 8 tables with sysbench 0.4.8. Sysbench itself is using a non-trivial amount of CPU and I could probably decently beat this number if I rewrote sysbench using the nonblocking API in libdrizzle (back when me made the Drizzle performance regression tests use a libdrizzle-ified sysbench we got double digit percentage improvement in our sysbench numbers).

There’s still around 7-10% idle CPU time… so there’s more room to grow.

Lacking a physical gauntlet to throw down, I’ll just have to submit a conference paper somewhere so that I can do that in person.

I really hope that we’re able to fix this bottleneck in MySQL 5.7 so that MySQL 5.7 will ship being able to do over a million queries per second. From SQL.

MySQL 5.7 on POWER

In a previous post, I covered porting MySQL 5.6 to POWER and subsequently, some new record performance numbers with MySQL 5.6.17 on POWER8.

Well, those following at home will be aware that not only is the next sentence sponsored by IBM Legal, but that MySQL 5.7 alleviates a bunch of the mutex contention that we saw with MySQL 5.6. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

In looking at MySQL performance on POWER, it’s inevitable that I should look at MySQL 5.7 and what’s coming up in the next stable release of MySQL.

Surprisingly, a bunch of the core code in InnoDB and MySQL dealing with mutexes has changed in MySQL 5.7 when compared to MySQL 5.6. Enough that I actually had to post a few bug reports about the changes that apply to any CPU architecture:

  • Bug 72805: mutex_delay() creating excess memory traffic, GCC mem barrier needed
    • This is now more generic mutex code, so it’s even more important to get it right. There’s a bunch of tricks that have been learned in other places (e.g. Linux kernel) in getting these things right. We need to get them right in MySQL too.
    • One of these tricks is in ensuring that the compiler doesn’t compile down spinloops to nothing.
  • Bug 72806: mutex_delay() missing x86 pause instruction optimization
    • This is actually a regression over 5.6.
    • On x86, there is an instruction (PAUSE) that tells the CPU that you’re in a spin loop and that it should yield resources in the CPU core to other threads (or thread, as HT CPUs only have 2 threads per core).
    • We have a different way of doing things on POWER, and I’ve got a patch for that too.
    • What’s interesting is reading the Intel CPU manual about the PAUSE instruction and how even if you went and benchmarked it, it depends on the CPU on if this is a NO-OP or not.
    • I suspect that with this bug fixed, performance on Hyper Threaded Intel systems will improve.
  • Bug 72807: Set thread priority in my_pthread_fastmutex_lock
    • This is the POWER equivalent of the x86 PAUSE instruction.
    • I’ve found this patch to have a quite decent positive impact on sysbench point select performance.

There were also the bugs I mentioned in my MySQL 5.6 on POWER blog post. Notably, I had to port Yasufumi’s memory barrier patch from 5.6 to 5.7. My port is incomplete (I can still crash mysqld without too much trying) but I’ve deemed it currently “good enough for benchmarking” and it’s attached to bug 47213 (I hope to spend some time fixing it up soon too). I don’t think I’m missing anything that’s going to have a major performance impact – so while not suitable for production use, it’s good enough to poke some benchmarks at.

So… I’m close to the point where I’ll share my patch for MySQL 5.7, but I’m really wanting to solve the last couple of issues before doing so. The majority of patches are attached to bug reports and get 99% of the way.

Amazingly enough, MySQL 5.7 works fairly well on POWER “out of the box”, and with sysbench point selects, I could quite easily get 320kQPS on a 24 core POWER8 with SMT8 mode without changing a single line of code or doing anything special. This alone is an impressive result when compared to the previous record on both POWER and other CPU architectures with MySQL 5.6 that had been optimized for POWER (while out-of-the-box MySQL 5.7 has not)
For my benchmarks, I’m doing the same procedure, workload and basic my.cnf settings that Dimitri has used and written about, so I won’t repeat that here.With my preliminary patch for MySQL 5.7.4-m14 to have it work well on POWER, on the same system I was using for my MySQL 5.6 benchmarks, I could easily match and indeed exceed the previous published maximum sysbench point select results (I got ~630kQPS). Consider this number a bit preliminary as my patch isn’t completely solid, but it does mean that we’re in the right ballpark for MySQL 5.7 performance, which is great news!So, you might just say “Mission Accomplished” and be done with it. Well… there was one issue:  with the maximum numbers I was getting there was still 30-40% idle CPU on the POWER 8 machine.Now… you could just use that idle 30-40% of total CPU to do other things (solving Sudoku in SQL for example) but that’s no fun.

MySQL 5.6 Performance on POWER8

The following sentence is brought to you by IBM Legal: The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

My previous post covered the work needed to get MySQL 5.6.17 running reliably on modern POWER systems. The patch to MySQL 5.6.17 that’s needed is available here.

For those who don’t know, POWER8 is the latest Power Architecture processors from IBM (my employer). These chips will be available in systems from IBM in June 2014 (i.e. Real Soon Now(TM)). There’s some fairly impressive specs and numbers (see Wikipedia and elsewhere) – but what could this mean for actual applications?

Well, it turns out that MySQL is a pretty big thing in some target markets for POWER8, and inspired by Dimitri’s impressive benchmark numbers, I thought we should have a go on POWER8.

Firstly, I focused on MySQL 5.6 as it is the current stable release. MySQL 5.7 will be the subject of a future blog post.

The first step was to ensure that MySQL 5.6 worked correctly on POWER. My previous blog post covered the few bugs I ran into and filed (often  with patches). This wasn’t too hard and I’m fairly confident the bug fixes are simple enough to get into MySQL 5.6 – I can’t comment on what would be/could be “officially supported”, that’s a business discussion :)

In order to ensure that my patch was not only correct but performing well, I needed a benchmark. For my initial benchmark. I chose sysbench point selects (i.e. read only key lookups), which should show the theoretical maximum queries per second you could pump through the MySQL Server as well as really stressing the mutex code, helping ensure it was not only correct, but performing well.

A simple comparison of my early patch that used heavyweight memory barriers versus Yasufumi’s patch that used more lightweight ones showed that using heavyweight barriers could be as much as a 50% performance hit – so getting this code right is important.

To add to the fun, the POWER8 processor has a few parameters you can tweak. There is the SMT mode, which dictates how many threads per core there are. This can be changed at runtime. You can be in SMT=off, SMT=2, SMT=4 or SMT=8. Typically, only some workloads can benefit from SMT8 rather than SMT4. There is also DSCR, which is data prefetching. For sysbench point selects, I’ve found we do slightly better (around 10%) when DSCR is set to 1 rather than zero – but YMMV on other benchmarks.

In my experiments, I’ve found that SMT4 or SMT8 seems to be the best bang for buck for MySQL workloads on POWER8. With SMT=2 rather than off, I’ve seen a ~50% performance boost in sysbench point select results. With SMT=4 I’ve seen another 50% boost (i.e. roughly double SMT=off performance). The benefit of SMT8 for MySQL 5.6 (and the 5.6 part is crucial here) may be minimal, especially for this benchmark. This is mostly due to hitting heavy mutex contention inside the MySQL server rather than anything else.

POWER8 systems come in either single or dual socket, with the number of cores being a total of 4, 6, 8, 10, 12, 16, 20 or 24 depending on configuration of the system (go check IBM web site for specifics of what’s available in what model). This means with SMT8, a dual socket, 24 core POWER8 system has 192 hardware threads – the system I was using for these benchmarks.With this number of cores and hardware threads, those familiar with MySQL on multi core systems may already have an inkling that using the full capacity of such a system may be hard for MySQL.

Certainly for old versions of MySQL (such as 5.0 or 5.1) you’re going to get nowhere near full system utilization on POWER8. For MySQL 5.6 (and in the future, 5.7) you have a much better hope.

Before anyone asks, yes, I used jemalloc for most of my benchmarks and it helps by giving a single digit percent performance increase (around 3-4%).

The bottlenecks inside MySQL 5.6 for sysbench point select workload are fairly well documented, so at best we may be striving to equal the performance of other CPU architectures rather than get too much higher simply due to hitting mutex contention in creating read views inside InnoDB. So the maximum performance will be a function of individual core CPU speed and the speed at which a lock can be acquired (i.e. related to how quick you can bounce a cacheline with a lock between cores).

This is exactly what I found on POWER8 with MySQL 5.6 – you hit the same bottleneck on POWER8 as you do everywhere else – creating read views in InnoDB.

That being said, my maximum sysbench point select results on POWER8 was 344kQPS. This not only matches but exceeds the previous record holder by quite a decent amount.

This number was across 8 tables with mysqld bound to a single NUMA node (6 cores) and sysbench bound to another NUMA node (6 cores) on the same socket. For this benchmark, due to the mutex contention, bringing the second socket into play didn’t improve performance. For other benchmarks, (e.g. standard sysbench read only) it seems to scale with more CPU cores much better (no doubt the subject of a future blog post).

Single table sysbench point select was also impressive at 335kQPS – you only got an additional 10kQPS by going to 8 tables! All of these results were with SMT4 and DSCR=1, which seems to be the best configuration for this type of workload.

Up next: MySQL 5.7 on POWER8.

MySQL 5.6 on POWER (patch available)

The following sentence is brought to you by IBM Legal. The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

Okay, now that is out of the way….

If you’re the kind of person who follows the MySQL bugs database closely or subscribes to the MySQL Internals mailing list, you may have worked out that I’ve spent a small amount of time poking at MySQL on modern POWER systems.

Unlike Intel CPUs, POWER CPUs require explicit memory barriers to synchronize memory state between different CPUs. This means that when you’re implementing synchronization primitives, you have one extra thing to get right.

Luckily, if you use straight pthread mutexes, this is already taken care of. Unluckily, there are some optimizations in MySQL that don’t use straight pthread mutexes and so may be problematic on non-Intel CPUs. A few of these issues have sneaked into MySQL over the past few years. The most problematic area was around the optimized mutexes in InnoDB (you can use the pthread_mutex fallback code, but it’s less performant).

Luckily, I both knew where to look and there are good asserts throughout InnoDB code to help spot any other areas that I may not have initially thought of to look at. Coding defensively with a good amount of asserts is a good thing.

After not too much work, I have a set of patches that I’m fairly confident is correct and performs near as well as possible. Initially, I had a different patch that used heavyweight memory barriers in a lot of places, but big kudos to Yasufumi for posting a better patch than mine to bug 47213 – using the lighter weight barriers gives a decent performance boost.

One of the key patches is in the InnoDB mutex code to change the thread priority – i.e. a POWER equivalent to the x86 pause instruction. These are hints to the CPU that the thread being executed is in a spinloop and CPU resources should be allocated to other threads to make betterr forward progress.

After dragging Anton in to have a look and a think, this code may have motivated him to have a go at getting kernel support for adaptive mutexes, thus removing the need for this spin/sleep/yield/eep loop in InnoDB (at least on Linux).

So… I’ve spent the appropriate time filing bugs in the MySQL bug tracker for the things I’ve found. Feel free to track them yourself, they are:

  • Bug 72715: character set code endianness dependent on CPU type rather than endianness of CP
    • I don’t think this is an issue for us… or it could be that this is actually just incredibly untested code in the MySQL Server. It’s also not POWER specific, although was caught by the Migration Assistant which is part of the Advanced Toolchain from IBM.
  • Bug 72718: CACHE_LINE_SIZE in innodb should be 128 on POWER
    • I contributed a patch that’s a simple #ifdef for CPU type. Those who care about other CPU architectures should chime in with the correct value for them.
    • There’s other places in InnoDB where there’s some padding that don’t use this define, I need to file a bug for that.
  • Bug 72754: Set thread priority in InnoDB mutex spinloop
    • This makes a big difference when you have mutex contention and SMT (Symmetric Multi-Threading) enabled (on POWER, you can dynamically change SMT levels at runtime).
    • I’ve contributed a preliminary patch that isn’t generic. I should go and fix that.
  • Bug 72755: InnoDB mutex spin loop is missing GCC barrier
    • This also applies to x86 (and indeed all platforms). If GCC gets a bit smarter, the current code could compile down to nothing, which is exactly what you don’t want from a spinloop. The correct thing to do is to have a GCC memory barrier (not CPU one) to ensure that the compiler doesn’t optimize away the spinning.
    • I’ve contributed a patch, may need #ifdef GCC added.
  • Bug 72809: InnoDB Linux native aio setup missing barrier after setup
    • This appears to be a “POWER8 is fast” related bug :)
    • Patch contributed.
  • Bug 72811: Set NUMA mempolicy for optimum mysqld performance
    • Not POWER specific.
    • I’ve contributed a patch that sets NUMA memory allocation policy inside mysqld rather than having to run “numactl” manually
  • Bug 47213: InnoDB mutex/rw_lock should be conscious about memory ordering other than Intel
    • Originally filed by Yasufumi back in 2009.
    • Some good discussion going on here to ensure the patch is correct. This is the kind of patch that requires more review  than it takes to write it.
    • This patch would fix the majority of problems for non-Intel CPU architectures.
    • Thanks to Yasufumi for providing an updated patch, it helped a lot!
  • Bug 72544: Incorrect locking for global_query_id
    • I found a bug. Rather benign and not POWER specific.

Want to run MySQL 5.6.17 on POWER? Get my MySQL 5.6.17 patch here: https://flamingspork.com/mysql/mysql-5.6.17-POWER.patch

My accumulation of 5.6 patches seems fairly reliable. I’d test before putting into production, and I’d certainly love to know any problems you hit.

Get the quilt series of patches here: https://flamingspork.com/mysql/mysql-5.6.17-POWER-patches.tar.gz

I have, of course, done the legal wrangling for the Oracle Contributor Agreement (remarkably painless) and am working on making the patches completely acceptable to be merged into MySQL.

Awesome MySQL 5.7 improvements

Recently, I’ve had reason to poke at MySQL performance on some pretty cool hardware. Comparing MySQL 5.6 to MySQL 5.7 is a pretty interesting thing to do when you have many CPU cores.

The improvements to creating read views in InnoDB is absolutely huge for small statements with large concurrency – MySQL 5.7 completely removes this as a bottleneck – as much as doubling maximum SQL queries per second, which is a pretty impressive improvement.

I haven’t poked at the similar improvements in Percona Server on this hardware setup – so I can only really guess as to the performance characteristics of it… If comparing to older MySQL versions, Percona Server 5.5 is likely to outperform MySQL 5.5 thanks to this optimization.

But I have to say… MySQL 5.7 is impressive in its concurrency improvements.

Efficiently writing to a log file from multiple threads

There’s a pattern I keep seeing in threaded programs (or indeed multiple processes) writing to a common log file. This is more of an antipattern than a pattern, and is often found in code that has existed for years.

Basically, it’s having a mutex to control concurrent writing to the log file. This is something you completely do not need.

The write system call takes care of it all for you. All you have to do is construct a buffer with your log entry in it (in C, malloc a char[] or have one per thread, in C++ std::string may do), open the log file with O_APPEND and then make a single write() syscall with the log entry.

This works for just about all situations you care about. If doing multi megabyte writes (a single log entry with multiple megabytes? ouch) then you may get into trouble on some systems and get partial writes (IIRC it may have been MacOS X and 8MB) and O_APPEND isn’t exactly awesome on NFS.

But, if what you’re wanting to do is implement something like a general query log, a slow query log or something like that, then you probably want to use this trick rather than, say, taking a pthread_mutex lock while you do malloc(), snprintf() and write(2).

When refactoring parts of Drizzle, we found this done the wrong way in a whole bunch of places in the MySQL server, largely explaining why things like the slow query log and general query log were such a huge drain on database server performance.

It’d be neat to see someone fix that.

Caring about stack usage

It may not be surprising that there’s been a few projects over the years that I’ve worked on where we’ve had to care about stack usage (to varying degrees).

For threaded userspace applications (e.g. MySQL, Drizzle) you get a certain amount of stack per thread – and you really don’t want to bust that. For a great many years now, there’s been both a configuration parameter in MySQL to set how much stack each thread (connection) gets as well as various checks in the source code to ensure there’s enough free stack to do a particular operation (IIRC open_table is the most hairy one of this in MySQL).

For the Linux Kernel, stack usage is a relatively (in)famous problem… although by now just about every real problem has been fixed and merely mentioning it is probably just the influence of the odd grey beard hairs I’m pretending not to notice.

In a current project I’m working on, it’s also something we have to care about.

It turns out that GCC has a few nice things to help you prevent unbounded stack usage or runaway stack usage. There’s two warnings you can enable.

There’s -Wstack-usage=len which will throw warnings on unbounded stack usage (e.g. array on stack sized based on an argument to the function), where stack usage is greater than len and when stack usage may exceed len.

There’s also -Wframe-larger-than=len which is based on calculation for a particular stack frame, as opposed to -Wstack-usage=len, which could be based on several stack frames.

Odds are, you may get some warnings in your project if you set this to what you would consider “conservative” values. Now, if this is every going to explode at runtime is something that’s left as an exercise for the reader, but enabling these warnings is pretty easy and a simple way to help find and prevent some issues.

After all, having your software explode for running off the end of the stack is just a tad embarrassing.

More interesting things in the POWER8/OpenPOWER world

There’s been a couple of interesting things published about things I’ve been working on/with recently.

The OpenPOWER Foundation did a press release on initial white box server design for POWER8. With the hardware, there’s a software stack: firmware and operating system stack developed by IBM, Google and Canonical. Basically, there’s going to be 96 threads per socket with 230GB/sec of memory bandwidth in your cloud, yo.

Obviously, there’s a bunch of stuff that can be talked about more in the future, post release of all these things. But it’s pretty cool to be able to see some information heading out the door in the lead up to POWER8 systems being available.

Ghosts of MySQL Past, part 11: Why are you happy about this?

This is part 11 in what’s shaping up to be the best part of a 6 week series (Part 1, 2, 3, 4, 5, 6, 7, 7.1, 8, 8.1, 9 and 10) on various history bits of MySQL, somewhat following my LCA2014 talk (video here).

One of my favorite MySQL stories is the introduction of the Random Query Generator – or randgen. While I’ve talked for a while now about the quality problems that were plaguing MySQL, there were positive steps going on. The introduction of continuous regression testing with pushbuild, increasing test case coverage and a focus on bug fixing were all helping. However… a database server is a complex beast that can accept arbitrary queries and transactions and at the very least is expected not to segfault. How on earth do you test it beyond simple manual testing (or automated manual testing: running the same set of static queries)?

Well, at some point in development, David Axmark (one of the founders of MySQL AB) decided that if they were going to be a real database, they should have a test suite – so he wrote one. It was comprised of two parts: a small binary named mysqltest which basically read a file of SQL, ran it against the server and then displayed the results (and also had a couple of small bits of language such as variables and loops) and the other half was a shell script that started a mysqld and ran mysqltest for each .test file in the mysql-test/t/ directory and diffed the output against the mysql-test/r/*.result files.

This system still exists today, although the shell script has been replaced by a perl script which has been replaced with mysql-test-run.pl version 2 which dealt with replication, starting MySQL Cluster daemons, running tests in parallel and dealt with Windows a lot better. The mysqltest binary is largely unchanged however, and so the now MUCH larger suite of SQL based tests would be familiar a dozen years ago, when the mysql-test suite was in infancy.

But how do you ensure you’re testing enough of the server? Well… you can do code coverage analysis and write more tests manually, but wouldn’t it be much better if you could automatically generate tests? Wouldn’t it be great if you could automatically find a set of queries that caused the server to crash? Or a set of queries where the results differed between versions or other database servers (which would strongly indicate that somebody was buggy)?

Well, it turns out that Microsoft had a similar problem back in the 1990s with Microsoft SQL Server and a system called RAGS was used in the development of Microsoft SQL Server 7.0 (released in 1998).

So, fast forward about ten years and a relatively new hire of MySQL AB, Philip Stoev was working on something called “RQG” or Random Query Generator for MySQL AB. The basic idea was to be able to supply a grammar of what type of queries you would like it to generate, point it at a MySQL server and see what happens.

It turns out that what happened was that after a while, the MySQL server would segfault. To a room of MySQL Server developers, this was a pretty amazing advancement in testing the server – an automated way to find new bugs!

Throwing the Random Query Generator at the in-progress MySQL 6.0 optimizer caused a new innovation: automatic stack trace deduplication and test case minimization. RQG would automatically pair down the list of generated queries to the minimal set required to crash the server, so that you could import that set of queries into the existing mysqltest based regression test suite.

The dataset that RQG would work against was rather fixed, it was about 10 already prepared tables with some easily generated data. In MySQL Cluster, we had a set of C++ classes that ran a multithreaded application that would use integer columns as a checksum for the whole row, so updates would also update the checksum and at the end of your test (which may cause node failures in various scenarios) you can check that each row is intact and hasn’t been corrupted. So, IIRC it may have been me (or Jonas, I honestly cannot remember) who asked if such a feature would be added to RQG and the reply was “when I can no longer crash the server in 4 or 5 queries… I’ll think about it”.

So, developers expressed happiness, and Philip expressed this: “Why are you happy about this? We are about 10 years behind Microsoft in QA, why are you possibly happy about this?”

So with this great new tool, the true state of the current server version, the in-development server version (6.0) and all the optimizer enhancements it was bringing, the new Falcon storage engine and the in progress Maria (now Aria) storage engine would be exposed.

The news was not good.

(although, in the long run, the news was really good, and it’s because of RQG that we now have a really solid MySQL Server)

Further reading:

The joy of Unicode

So, back in late 2008, rather soon after we got to start working on Drizzle full time, someone discovered unicodesnowmanforyou.com, or:

☃

Since we had decided that Drizzle was going to be UTF-8 everywhere,(after seeing for years how hard it was for people to get character sets correct in MySQL) we soon added ☃.test to the tree, which tried a few interesting things:

CREATE TABLE ☃; CREATE DATABASE ☃; etc etc

Because what better to show off UTF-8 than using odd Unicode characters for table names, database names and file names. Well… it turns out we were all good except if you attempted to check out the source tree on Solaris. It was some combination of Python, Bazaar and Solaris that meant you just got python stacktraces and no source tree. So, if you look now it’s actually snowman.test and has been since the end of 2008, because Solaris 10.

A little while later, I was talking to Anthony Baxter at OSDC in Sydney and he mentioned Unicode above 2^16 in UTF-8…. so, we had clef.test (we’d learned since ☃ and we were not going to tall it 𝄢.test).

Fast forward a few years to, well, this week, and I was talking to Jeremy Kerr about petitboot and telling the tail of snowman.test. So out came the crazy Unicode characters:

  • U+1F4A9 PILE OF POO 💩
  • U+1F435 MONKEY FACE 🐵
  • U+1F431 CAT FACE 🐱
  • U+1F602 FACE WITH TEARS OF JOY 😂
  • U+1F639 CAT FACE WITH TEARS OF JOY 😹

But guess what, there is no MONKEY FACE WITH TEARS OF JOY! I know, this is just unacceptable – TEARS OF JOY should be a modifier, because you may need U+1F6B9 MENS SYMBOL 🚹 with a TEARS OF JOY modifier at some point in your life.

Anyway, another place with tests involving odd Unicode characters is good for everyone, but still lacking if you need to boot an Operating System that’s MONKEY FACE WITH TEARS OF JOY.

Ghosts of MySQL past, part 10

At the end of 2007, the first alpha of MySQL 6.0 was released. Alpha is the key word here, 5.0 was in decent shape by this stage, 5.1 was not ready and there were at least rumors of a MySQL 6.1 tree. I’ll go into MySQL 6.0 later, but since you’re no doubt not running it, you can probably guess the end of that story.

The big news of 2008 was that at the last MySQL AB all company meeting, the first and last one to occur while I was with the company, in January 2008 in Orlando, Florida, it was announced that Sun Microsystems was going to acquire MySQL AB for the sum of ONE BILLION DOLLARS. Unfortunately, there were no Dr Evil costumes at the event. However, much to the amusement of the Sun employees who were present at the announcement, there was Vodka.

So, while the storage engine project that was to be the savior of and the future of the company, Falcon, had the same name as a Swedish beer (that I’ve actually not seen outside Sweden), the code name for the Sun acquisition was Heineken, and there were a few going around in celebration after the acquisition was announced.

IMG_20140310_111449I cannot remember where I got this badge (or button if you speak American), but they were going around for a little while, and it was certainly around the Sun era, possibly either at a User Conference or around the Sun acquisition.

Sun was very interested in Falcon. With the new T1000 SPARC processor, with eleventy billion threads over several cores and squarely aimed at internet/web companies, it was also thought there could be some.. err.. synergy there.

So what happened with the Sun acquisition?

Well, we soon discovered that although Sun made a lot of noise as to its commitment to free and open source software, its employment contract, IP agreements and processes were sorely not what we expected. There is, in fact, probably a good book in the process of MySQL AB engineers joining Sun, but the end result ended up being rather good. We changed Sun and made it a whole lot easier for anybody inside Sun to contribute to open source projects.

There was some great people inside Sun that we were able to work with, yet there was also a good dose of WTF every so often as well. There was a great “benchmark” that showed what kind of performance you could get out of a T2000 with MySQL once – you just had to run 32 instances of mysqld for it to scale!

One thing Sun had worked out how to do was release software that worked relatively well. Solaris was known for many things: Slowaris, ancient desktop environment, a userspace more suited to 1994 than 2004 but one thing it was not known for was awful quality of releases.

Solaris required code to work before it was merged into the main branch and if code wasn’t ready to be released, it wouldn’t be merged, missing the train. While this is something that would seem obvious today, it was kind of foreign to MySQL – and it would take a while for MySQL to change its development process.

The biggest problem with Sun was that it was not making money. If you want to know what was most wrong with the company, talk to one of the MySQL sales people who went across to Sun. It was basically near impossible to actually sell something to someone. Even once you had someone saying “I want to give you money for these goods and/or services”, actually getting it through the process Sun had was insane. At the same time I ran a little experiment: how easy was it to buy support for Red Hat Enterprise Linux versus Solaris… you could guess which one had a “buy” button on redhat.com and which one was impossible to find on sun.com.

We (of course) got extra tools, resources and motivation to have MySQL work well on Solaris, especially OpenSolaris which was going to be the next big thing in Solaris land.

I should note, the acquisition did not make millionaires out of all the developers. It did make some VCs some money, as well as executives – but if you look around all the developers at MySQL AB, not a single one has subsequently called in rich.

Ghosts of MySQL Past, Part 9: BEST. Team. Name. EVER.

(This is part 9 in a series, part 8 is here – because reverse chronological order totally makes sense here)

So, back around 2007, somebody noticed that an awful lot of the downloads of MySQL and associated utilities from mysql.com were for Windows. Of course, it’s then immediately pointed out that the vast majority of Linux users will not be heading to mysql.com to download MySQL, instead using the packages from their distribution.

However, the number of people working on MySQL who had ever even attempted to compile the MySQL server on Windows when given the number of mysql users on Windows was… well…. rather embarrassing. This is very common with free and open source software – especially historically.

If you look back 10 years, Linux on the desktop was “next year will be the year of Linux on the desktop”, you know, just like how it is now. Except that while today, everything “just works” on modern Linux distros and it is Windows that is an absolute arse to get all the drivers for your hardware going, it was not that long ago that it was the other way around. Also, back then it was much more common for it to be hard to get FOSS into companies… because of a variety of reasons that weren’t valid, but were something we had to spend time explaining back then.

I’m trying to remember the MySQL developers at the time who I knew attempted to do all their MySQL Server development on Windows…. I can think of Reggie along with Vlad and Iggy (both of who joined later). It was quite rightly pointed out that the MySQL experience on Windows was very much one of “UNIX application ported to Windows” rather than “Windows server application, that also comes in a UNIX version.”

MySQL basically did absolutely nothing the way you’d expect a piece of Windows server software to do it. So, a team was put together. Well, more of a task force.

The Windows Task Force (yes, WTF) was born. It is also the best team name in the history of team names and gives me a good chuckle to this day.

Many things came out of this, including (IIRC) an Installer that we actually could find the source code for and didn’t require Delphi to build, the ability to build the server on Windows from a bzr source tree (and not need Linux/UNIX around to build some parts of the server) and making the server behave a bit more like what you’d expect if you were a Windows administrator. There were also a bunch of limits lifted due to the way that MySQL was ported to Windows, which I won’t go into here.

Ghosts of MySQL past, part 8.1: Five Years

With many apologies to David Bowie, come 2009 it was my 5 year anniversary with Sun (well, MySQL AB and then Sun). Companies tend to like to talk about how they like to retain employees and that many employees stay with the company for a long time. It is very, very expensive to hire the right people, so this is largely a good plan.

In 2009, it was five years since I joined MySQL AB – something I didn’t quite originally expect (let’s face it, in the modern tech industry you’re always surprised when you’re somewhere for more than a few years).

The whole process of having been with sun for 5 years seemed rather impersonal… it largely felt like a form letter automatically sent out with a certificate and small badge (below). You also got to go onto a web site and choose from a variety of gifts. So, what did I choose? The one thing that would be useful at Burning Man: a camelbak.

5 year badge for Sun Microsystems

5 year badge for Sun Microsystems

Ghosts of MySQL Past, Part 8: The First Fork.

This is the 8th installment in the rather long series that started with Part 1 about a month ago.

Back in 2006, we were in the situation where MySQL 5.0 had taken forever, and the first “GA” release was not suitable for production. Looking towards MySQL 5.1, it was also unlikely to be out any time soon. The MySQL Cluster team had customers that needed new features in a stable release. The majority of users didn’t use the MySQL server at all, they directly used the C++ NDB API for the vast majority of queries – so the vast majority of release blocker bugs in the MySQL server would not affect the production readiness of MySQL Cluster for these customers.

So, the decision was wisely made to do separate releases from a separate tree for MySQL Cluster. This was named MySQL Cluster: Carrier Grade Edition and exists to this day.

The main use case for MySQL Cluster at this time was running telephone networks, specifically the Home Location Registry databases of GSM phone networks. Basically, you need to keep a database of which tower each subscriber is associated with so when you go to make a phone call (or SMS) the network can properly route the call. This means there’s some realtime response requirements and hardcore availabilty which demands a special type of database.

NDB has a long history (some of it detailed in a previous post), but for those kind of interested in internals, I’ll quote Frazer Clement (now a long time MySQL Cluster developer although I completely forget which year he joined the team, which is just slightly embarrassing):

…Erlang and Ndb Cluster share some Plex heritage, which can still be seen in their architectures today. Since Plex, Erlang has mated with Prolog, and Ndb Cluster was involved in a car crash with C++.

The customers for MySQL Cluster were not the ones who bought the $5000 support option… Typically, paying for the addition of a relatively major feature was considered quite okay and even “normal”. After all, the effort that goes into constructing the entire software stack of a large cell phone network is rather large, and MySQL Cluster would end up being a very small part of that.

Many major features went into MySQL Cluster first, sometimes years before they made the general MySQL Server: row based replication, circular replication with conflict resolution and online DDL (add/drop index and column). In fact, it kind of incredibly frustrates me that we solved online add column in NDB so many years ago and you still can’t add a column to an InnoDB table without some serious planning.

The key thing about MySQL Cluster releases? They happened. They were also regular, addressing customer issues and new major versions brought features that worked for the use cases of those who needed them.

Interestingly enough, you may recognize the person who ran the MySQL Cluster team as the same person who has overseen the now regular release cycle of the MySQL Server itself (and has now been in the MySQL world for over ten years).

Ghosts of MySQL Past, Part 7.1: Hey look, I found an old business card

Digging through piles of old stuff in the house, I came across a nice little artifact of MySQL history, an old business card. One benefit of a small company is that you can tend to get your first name @company.com, which of course, I had.

IMG_20140310_111500I wonder which other MySQL AB employees still have a business card laying around somewhere?

Ghosts of MySQL Past, Part 7: PBXT

Recently, I’ve been writing based on my linux.conf.au 2014 talk, which you can watch the recording of. Also see Part 1, Part 2, Part 3, Part 4, Part 5 and Part 6. My feed feel off Planet MySQL for a bit so you may have missed those posts – so feel free to go on a trip down memory lane before returning here for, well, more of a trip down memory lane.

At the start of 2005, Paul McCullagh started working on a new transactional storage engine for MySQL. He announced it in early 2006, so the majority of initial development was done before Oracle bought InnoBase Oy.

It’s at this point I should correct myself from my Part 5 where I was talking about when Maria started – it was actually at the start of 2005 when the project started, about 10 months before InnoDB Friday. I think this was mostly small scale work at first, before months later having a larger focus on it.

But anyway, let’s talk about PBXT! It had a different architecture than InnoDB and thus could have an interesting set of performance characteristics. The shortest way to describe the original architecture is “kind of log based”. Rows are written to a log file and there’s a handle file that points to where the rows are in the log.

Also, the design had both a fixed size and a variable sized part of the row. The fixed size part was stored alongside the handle in the record data file, so it was incredibly quick at getting the fixed size part of the row.

Not having to write data twice and having really quick access to some columns gave PBXT a quite decent performance advantage for many work loads.

The BLOB Streaming feature that was also worked on (originally just for PBXT and then for any engine) was quite ahead of its time. Think HandlerSocket but think of it done more correctly. HandlerSocket is an awful binary protocol that supports very limited operations and uses two TCP ports (one for reads, one for writes). The blob streaming used HTTP, something that anyone could use and manipulate and proxy and whatever you’d want.

If we look at Paul’s Top 5 Wishes for storage engines, an engine test suite and sane APIs (or at least documented) would clearly have helped. Having to do API tracing to discover when you should start a transaction is probably not the best way to attract developers – the amount of time that could have been saved given better interfaces in the server was immense (and not just for PBXT developers).

So, why aren’t we all using PBXT now? Well… if MySQL had shipped with PBXT, it would possibly have been a different story. There was (and indeed is) a very uneven playing field for MySQL storage engines. If you’re in the main MySQL tree then you’re everywhere. If not, then you’ve got a heck of a lot of work on packaging and keeping up to date with various API changes (not to mention ABI, which since we’re talking C++ and since we’re talking MySQL, changed a lot). With InnoDB being “good enough” for many and actually being forced to improve due to challengers such as PBXT (as well as being in the tree) it just continued to have the majority of the users… and it’s not easy making money as a storage engine vendor – and developing a storage engine does cost money.

With no more than a couple of people working on it at any one time, it’s amazing that PBXT had such a big influence – and we can likely credit many InnoDB performance improvement to PBXT being so much better in some areas.

An interesting side story: I was sending some build fixes/complier warning fixes and other minimal patches to Paul when MySQL belonged to Sun and he mentioned that perhaps it was time he had a contributor agreement. I pointed out the Sun contributor agreement as an example of one that could be used, and maybe he could just replace “Sun Microsystems” with “PrimeBase” and I could submit that to the Sun system – at the very least, it would be amusing. Paul said that sounded like a great idea and I submitted Sun’s contributor agreement (which it expected outside contributors to agree to) to Sun for them to agree to.

Well… a few weeks later I got a response, and Sun wasn’t exactly keen on it – until I pointed out a few times that it was in fact their agreement and then finally it was all okay.

If your company wants people to agree to a contributor agreement: see if they’d agree to it if it was for someone else. It was an interesting exercise.

Further reading:

Ghosts of MySQL Past, Part 6: The engine revs

This week I’ve been writing based on my linux.conf.au 2014 talk, which you can watch the recording of. Also see Part 1, Part 2, Part 3, Part 4 and Part 5. My feed feel off Planet MySQL for a bit so you may have missed those posts.

Netfrastructure was many other things apart from just a database engine, but the one part that was interesting to MySQL was the storage engine. One interesting thing about Netfrastructure was the inbuilt Java VM, which meant you could do all sorts of neat tricks right near the data and with Jim and Ann coming from much more feature complete databases, some parts of MySQL were going to be a little bit of a shock.

There was still work to be done on the Falcon engine itself, it wasn’t completely finished and ready to just drop in, even though you may have heard things like “to be completed later this year”.

There was also another experiment to see if MySQL could get a storage engine that wasn’t owned by somebody else: resurrect Gemini. Remember, it was GPL licensed now, and it could conceivably be updated to the latest MySQL versions. You can go and see the code for the Amira project on launchpad. Ultimately, this went nowhere, but it could have been an interesting story if this project got more resources. Ultimately, it was Falcon that got all the attention and effort.

This was going to be the ultimate triumph of the Pluggable Storage Engine Architecture, something that had been touted in presentations and marketing material for years. What really happened? Well, it took NDB (now MySQL Cluster) somewhere between two and four years before you could really say it was stable and free of many “obvious” bugs/unintended deviations in behavior. Would Falcon be any different?

In reality, the Falcon project unearthed all of the warts of the storage engine interface. There were many exchanges of “Is it really like that?”, “Yes, yes it is, sorry.”

The APIs (and I use the term loosely) between a storage engine and the database server were originally used to plug in ISAM and then MyISAM, and the first transactional store (InnoDB) required a bunch of hacks to work and the API was never cleaned up. When the third transactional engine came along (MySQL Cluster) – which was also a distributed engine – there were again no real fixes in the API, just more of the same work-arounds and oddness. So by the time Falcon came along, if you were lucky the API was merely just not what you’d expect. There were many, many situations where not only was the official documentation inaccurate, but you’d have to commit several gross layering violations just to get something rather fundamental to a transactional database working.

Arguably the worst thing was foreign keys, a requirement for the Falcon project. You had to implement your own SQL parser to get any information on foreign keys. Let’s not even get started on auto_increment – something which the correct behavior is pretty much completely undocumented.

Next, we’ll look at PBXT.

Ghosts of MySQL Past Part 5: The Era of Acquisitions

This week I’ve been writing based on my linux.conf.au 2014 talk, which you can watch the recording of.

Also see Part 1, Part 2, Part 3 and Part 4. My feed feel off Planet MySQL for a bit so you may have missed those posts.

Now we head into the era of acquisitions… there have been a few in MySQL history, and in 2005 came the second (the first was MySQL AB acquiring Alzato for NDB). In what was to be known as “InnoDB Friday”, the makers of InnoDB – Innobase Oy – was acquired by Oracle. That very same month….

MySQL 5.0 GA. The first GA release of MySQL 5.0 is infamous. It was nowhere near ready and everybody who tried to use 5.0 in the early GA days has a story about something obvious that was broken. Basically, the majority of the new features simply didn’t work. It took many point releases before people would consider 5.0 ready.

The real measure of 5.0 quality was that it took MySQL AB over a year before we started to use it for our support database.

At the end of 2005, the Maria project was started: a project to create a transactional storage engine. This should not be confused with MariaDB, which would come years later. This is Maria, now called Aria. The basic idea was to fork MyISAM and work on adding features. In hindsight, it’s easy to see that when you have a quality problem with your main product, you should probably not take a bunch of senior engineers and have them work on a different project. IIRC there was some initial estimate of a GA by the end of 2007. It’s now eight years since the project started and there’s still no stable release.

There were other efforts to get a transactional storage engine not owned by Oracle, and in 2006 MySQL AB acquired Netfrastructure and along with it Jim Starkey and Ann Harrison came to work for MySQL AB.

Originally named JSTAR, this would become known as Falcon (probably something to do with the Swedish beer by the same name).

Ghosts of MySQL Past: Part 4, A million features for Enterprise

Continuing on from Part 3….

SAP is all about Enterprises and as such, used all the Enterprise features of databases. This is a much different application than every user of MySQL so far (which were often web applications, and increasingly being used at scale). This was “MySQL focuses on Enterprise”.

This is 2003: before VIEWS, before stored procedures, before triggers, before cursors, before prepared statements, before precision math, before SQL access to logs, before any real deadlock info, before you could use a / in a table name, before any performance statistics, before full unicode and before partitioning, row level binary logs, fractional seconds in temporal types and before EXPLAIN for INSERT/UPDATE/DELETE (which we still don’t have). There was a long way to go before MySQL would fit with SAP. A long, long way.

But how were we going to run our cell phone networks on MySQL? There was an answer to that as MySQL AB acquired an Ericsson spinoff, Alzato for it’s clustered database server, NDB, which would become MySQL Cluster. This is separate from replication and originally designed for GSM HLR databases. Early versions had two types of availability: five nines (99.999%) or zero nines (0.000%) – it either worked or exploded. Taking a database engine that wasn’t fully mature and integrating it with MySQL is not a task for the faint of heart. It wouldn’t be the last time this was attempted and it would prove to be the most successful.

In October of 2004, MySQL 4.1 went GA – the first version with MySQL Cluster and incidentally, just before I joined MySQL AB on the MySQL Cluster team.

By this time, MySQL AB had hired just about all outside contributors, as, well, what better job interview than a patch? The struggle to have a good relationship with outside contributors started a long time ago, it’s certainly nothing new.

2005 was the year of MySQL 5.0 instability. At the time, 5.0 was the development version and it was plagued by instability. It was really hard for development teams (such as the SAP team) to work on deliverables when every time they’d rebase their work on a new 5.0 trunk version, the server would explode in new and interesting ways.

In fact, the situation was so bad that Continuous Integration was re-invented. Basically, it was feature++; stability–. The fact that at the same time even less stable development trees existed did not bode well.