New CREATE TABLE performance record!

4 min 20 sec

So next time somebody complains about NDB taking a long time in CREATE TABLE, you’re welcome to point them to this :)

  • A single CREATE TABLE statement
  • It had ONE column
  • It was an ENUM column.
  • With 70,000 possible values.
  • It was 605kb of SQL.
  • It ran on Drizzle

This was to test if you could create an ENUM column with greater than 216 possible values (you’re not supposed to be able to) – bug 589031 has been filed.

How does it compare to MySQL? Well… there are other problems (Bug 54194 – ENUM limit of 65535 elements isn’t true filed). Since we don’t have any limitations in Drizzle due to the FRM file format, we actually get to execute the CREATE TABLE statement.

Still, why did this take four and a half minutes? I luckily managed to run poor man’s profiler during query execution. I very easily found out that I had this thread constantly running check_duplicates_in_interval(), which does a stupid linear search for duplicates. It turns out, that for 70,000 items, this takes approximately four minutes and 19.5 seconds. Bug 589055 CREATE TABLE with ENUM fields with large elements takes forever (where forever is defined as a bit over four minutes) filed.

So I replaced check_duplicates_in_interval() with a implementation using a hash table (boost::unordered_set actually) as I wasn’t quite immediately in the mood for ripping out all of TYPELIB from the server. I can now run the CREATE TABLE statement in less than half a second.

So now, I can run my test case in much less time and indeed check for correct behaviour rather quickly.

I do have an urge to find out how big I can get a valid table definition file to though…. should be over 32MB…

BLOBS in the Drizzle/MySQL Storage Engine API

Another (AFAIK) undocumented part of the Storage Engine API:

We all know what a normal row looks like in Drizzle/MySQL row format (a NULL bitmap and then column data):

Nothing that special. It’s a fixed sized buffer, Field objects reference into it, you read out of it and write the values into your engine. However, when you get to BLOBs, we can’t use a fixed sized buffer as BLOBs may be quite large. So, the format with BLOBS is the bit in the row is a length of the blob (1, 2, 3 or 4 bytes – in Drizzle it’s only 3 or 4 bytes now and soon only 4 bytes once we fix a bug that isn’t interesting to discuss here). The Second part of the in-row part is a pointer to a location in memory where the BLOB is stored. So a row that has a BLOB in it looks something like this:

The size of the pointer is (of course) platform dependent. On 32bit machines it’s 4 bytes and on 64bit machines it’s 8 bytes.

Now, if I were any other source of documentation, I’d stop right here.

But I’m not. I’m a programmer writing a Storage Engine who now has the crucial question of memory management.

When your engine is given the row from the upper layer (such as doInsertRecord()/write_row()) you don’t have to worry, for the duration of the call, the memory will be there (don’t count on it being there after though, so if you’re not going to immediately splat it somewhere, make your own copy).

For reading, you are expected to provide a pointer to a location in memory that is valid until the next call to your Cursor. For example, rnd_next() call reads a BLOB field and your engine provides a pointer. At the subsequent rnd_next() call, it can free that pointer (or at doStopTableScan()/rnd_end()).

HOWEVER, this is true except for index_read_idx_map(), which in the default implementation in the Cursor (handler) base class ends up doing a doStartIndexScan(), index_read(), doEndIndexScan(). This means that if a BLOB was read, the engine could have (quite rightly) freed that memory already. In this case, you must keep the memory around until either a reset() or extra(HA_EXTRA_FLUSH) call.

This exception is tested (by accident) by a whole single query in type_blob.test – a monster of a query that’s about a seven way join with a group by and an order by. It would be quite possible to write a fairly functional engine and completely miss this.

Good luck.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Using the row buffer in Drizzle (and MySQL)

Here’s another bit of the API you may need to use in your storage engine (it also seems to be a rather unknown. I believe the only place where this has really been documented is ha_ndbcluster.cc, so here goes….

Drizzle (through inheritance from MySQL) has its own (in memory) row format (it could be said that it has several, but we’ll ignore that for the moment for sanity). This is used inside the server for a number of things. When writing a Storage Engine all you really need to know is that you’re expected to write these into your engine and return them from your engine.

The row buffer format itself is kind-of documented (in that it’s mentioned in the MySQL Internals documentation) but everywhere that’s ever pointed to makes the (big) assumption that you’re going to be implementing an engine that just uses a more compact variant of the in-memory row format. The notable exception is the CSV engine, which only ever cares about textual representations of data (calling val_str() on a Field is pretty simple).

The basic layout is a NULL bitmap plus the data for each non-null column:

Except that the NULL bitmap is byte aligned. So in the above diagram, with four nullable columns, it would actually be padded out to 1 byte:

Each column is stored in a type-specific way.

Each Table (an instance of an open table which a Cursor is used to iterate over parts of) has two row buffers in it: record[0] and record[1]. For the most part, the Cursor implementation for your Storage Engine only ever has to deal with record[0]. However, sometimes you may be asked to read a row into record[1], so your engine must deal with that too.

A Row (no, there’s no object for that… you just get a pointer to somewhere in memory) is made up of Fields (as in Field objects). It’s really made up of lots of things, but if you’re dealing with the row format, a row is made up of fields. The Field objects let you get the value out of a row in a number of ways. For an integer column, you can call Field::val_int() to get the value as an integer, or you can call val_str() to get it as a string (this is what the CSV engine does, just calls val_str() on each Field).

The Field objects are not part of a row in any way. They instead have a pointer to record[0] stored in them. This doesn’t help you if you need to access record[1] (because that can be passed into your Cursor methods). Although the buffer passed into various Cursor methods is usually record[0] it is not always record[0]. How do you use the Field objects to access fields in the row buffer then? The answer is the Field::move_field_offset(ptrdiff_t) method. Here is how you can use it in your code:

ptrdiff_t row_offset= buf - table->record[0];
(**field).move_field_offset(row_offset);
(do things with field)
(**field).move_field_offset(-row_offset);

Yes, this API completely sucks and is very easy to misuse and abuse – especially in error handling cases. We’re currently discussing some alternatives for Drizzle.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

The rotating blades database benchmark

(and before you ask, yes “rotating blades” comes from “become a fan”)

I’m forming the ideas here first and then we can go and implement it. Feedback is much appreciated.

Two tables.

Table one looks like this:

CREATE TABLE fan_of (
user_id BIGINT,
item_id BIGINT,
PRIMARY KEY (user_id, item_id),
INDEX (item_id)
);

That is, two columns, both 64bit integers. The primary key covers both columns (a user cannot be a fan of something more than once) and can be used to look up all things the user is a fan of. There is also an index over item_id so that you can find out which users are a fan of an item.

The second table looks like this:

CREATE TABLE fan_count (
item_id BIGINT PRIMARY KEY,
fans BIGINT
);

Both tables start empty.

You will have 1000, 2000,4000 and 8000 concurrent clients attempting to run the queries. These concurrent clients must behave as if they could be coming from a web server. The spirit of the benchmark is to have 8000 threads (or processes) talk to the database server independent of each other.

The following set of queries will be run a total of 23,000,000 (twenty three million) times. The my_user_id below is an incrementing ID per connection allocated by partitioning 23,000,000 evenly between all the concurrent clients (e.g. for 1000 connections each connection gets 23,000 sequential ids)

You must run the following queries.

  • How many fans are there of item 12345678 (e.g. SELECT fans FROM fan_count WHERE item_id=12345678)
  • Is my_user_id already a fan of item 12345678 (e.g. SELECT user_id FROM fan_of WHERE user_id=my_user_id AND item_id=12345678)
  • The next two queries MUST be in the same transaction:
    • my_user_id becomes a fan of item 12345678 (e.g. INSERT INTO fans (user_id,item_id) values (my_user_id, 12345678))
    • increment count of fans (e.g. UPDATE fan_count SET fans=fans+1 WHERE item_id=12345678)

For the first query you are allowed to use a caching layer (such as memcached) but the expiry time must be 5 seconds or less.

You do not have to use SQL. You must however obey the transaction boundary above. The insert and the update must be part of the same transaction.

Results should include: min, avg, max response time for each query as well as the total time to execute the benchmark.

Data must be durable to a machine being switched off and must still be available with that machine switched off. If committing to local disk, you must also replicate to another machine. If running asynchronous replication, the clock does not stop until all changes have been applied on the slave. If doing asynchronous replication, you must also record the replication delay throughout the entire test.

In the event of timeout or deadlock in doing the insert and update part, you must go back to the first query (how many fans) and retry. Having to retry does not count towards the 23,000,000 runs.

At the end of the benchmark, the query SELECT fans FROM fan_count WHERE item_id=12345678 should return 23,000,000.

Yes, this is a very evil benchmark. It seems to be a bit indicative about the kind of peak load that can be experienced by a bunch of Web 2.0 sites that have a “like” or “become a fan” style buttons. I fully expect the following:

  • Pretty much all systems will nosedive in performance after 1000 concurrent clients
  • Transaction rollbacks due to deadlock detection or lock wait timeouts will be a lot.
  • Many existing systems and setups not complete it in reasonable time.
  • A solution using Scale Stack to be an early winner (backed by MySQL or Drizzle)
  • Somebody influenced by Domas turning InnoDB deadlock detection off very quickly.
  • Somebody to call this benchmark “stupid” (that person will have a system that fails dismally at this benchmark)
  • Somebody who actually has any knowledge of modern large scale web apps to suggest improvements
  • Nobody even attempting to benchmark the Oracle database
  • Somebody submitting results with MySQL to not wait until the replication stream has finished applying.
  • Some NoSQL systems to suck considerably more than their SQL counterparts.

AlsoSQL

So there’s a bit of a swelling around the idea of NoSQL. That is, databases that don’t have an SQL interface in front of them – with the promise of better performance. With a well designed backend, this is no doubt the case.

A flexible query language is rather useful though. I think we’ll see the rise of AlsoSQL. That is systems that present a fast and simple protocol along with a SQL interface.

This hybrid system has seen use for many years. MySQL Cluster is one such example. SQL through MySQL Server, NoSQL through NDB API.

With Drizzle, I feel we’ll be in a pretty good position to offer non-sql based protocols and access methods to existing storage engines.

The Drizzle (and MySQL) Key tuple format

Here’s something that’s not really documented anywhere (unless you count ha_innodb.cc as a source of server documentation). You may have some idea about the MySQL/Drizzle row buffer format. This is passed around the storage engine interface: in for write_row and update_row and out for the various scan and index read methods.

If you want to see the docs for it that exist in the code, check out store_key_val_for_row in ha_innodb.cc.

However, there is another format that is passed to your engine (and that your engine is expected to understand) and for lack of a better name, I’m going to call it the key tuple format. The first place you’ll probably see this is when implementing the index_read function for a Cursor (or handler in MySQL speak).

You get two things: a pointer to the buffer and the length of the buffer. Since a key can be made up of multiple parts, some of which can be NULL and some of which can be of variable length, this buffer is not (usually) a simple value. If you are starting out in your engine development, you can use this buffer blindly as a single value for non-nullable indexes with only 1 column.

The basic format is this:

  • The buffer is in-order of the index. First column in the index is first in the buffer, second second etc.
  • The buffer must be zero-filled. The server kernel will use memcmp to compare two key values.
  • If the column is NULLable, then the first byte is set to 1 if the column is null. Else, 0 means not-null.
  • From ha_innodb.cc (for BLOBs, which I haven’t put in embedded_innodb yet): If the column is of a BLOB type (it must be a column prefix field in this case), then we put the length of the data in the field to the next 2 bytes, in the little-endian format. If the field is SQL NULL, then these 2 bytes are set to 0. Note that the length of data in the field is <= column prefix length.
  • For fixed length fields (such as int), the next max field length bytes are for that field.
  • For VARCHAR, there is always a 2 byte (in little endian) length. This is different to the row format, which may have 1 or 2 bytes. In the key tuple format it is ALWAYS two bytes.

I’ll discuss the use of this for rnd_pos() and position() in a later post…

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Thoughts on Thoughts on Drizzle :)

Mark has some good thoughts on drizzle. I think they’re all valid… and have some extra thoughts too:

“I have problems to solve today”. This is (of course) an active concern in my brain… If we don’t have something out that solves some set of problems with reasonable stability and reliability (and soon), then we are failing. I feel we’re getting there, and will have a solid foundation to build upon.

Drizzle replication, MySQL replication: “I can’t compare the two until Drizzle replication is running in production.“. Completely agree. We need to only say replication is stable and reliable when it really is. Realistic test suites are needed. Very defensive programming of the replication system is needed (you want to know when something has gone wrong). We also need to have it constantly be verifying the right thing is going on. We want our problems to be user visible, not silent and invisible. Having high standards will hopefully pay off when people start running it in production….

3 byte int: “Does this mean that some of my tables will grow from 3GB to 4GB on disk?” I think we’re moving the responsibility down to the engines. The 3 byte int type says two things: use less storage space, limit the maximum value. Often you want the former, not the latter. There are many ways to more efficiently pack integers for storage when they are usually smaller than the maximum you want. The protobuf library does a good job of it.

I think it is the job of storage engines to do better here. Once you’re in memory, 3 byte numbers are horrible to work with.. copy out of a row buffer, convert into a 32bit number and then do foo. Modern CPUs favor 32 or 64bit alignment of data a *lot*. 3byte numbers do not align to 32 or 64bits very well… making things much slower for the common case of using cached data.

“I need stored procedures. They are required for high-performance OLTP as they minimize transaction duration for multi-statement transactions.” The reduction of network round trips is always crucial. I think a lot of round trips could go away if you could issue multiple statements at once (not via semicolon separating them, by protocol awesomeness).

There should be a way to send a set of statements that should be executed. There should also be a way to specify that if no error occurred, commit. This could then be (in the common case) a single round trip to the database. You then only have to make round-trips when what statement to issue next depends on the result of a previous one. The next step being to reduce these round trips… which can either be solved by executing something inside the database server (e.g. stored procedures) or something closer to the database server so that the round trips aren’t as large. This would be where Gearman enters.

I’m interested to see where these two approaches (issuing in batches and executing closer to the DB server) fall down… I know that latency may not be as good… but throughput should be a lot better.

I take heart with “I have yet to use them in MySQL” though. I have my own theories as to why this is… my biggest thought is that it’s because the many, many programmers writing SQL that Mark sees aren’t SQL Stored Procedure programmers. They spend their days in a couple of languages (maybe Perl, Python, PHP, Java, C, C++) and never programmed SQL:2003 Stored Procedures and it just doesn’t come as quickly (or as bug free) as writing code in the languages you use every day.

“Long Running insert, update and delete statements consume too many resources in InnoDB.” I wonder if this desire for MyISAM could be filled by PBXT or BlitzDB? The main reason that MyISAM is currently a temporary table only engine is that MyISAM and the server core were never that well separated.

My ultimate wish is that all engine authors take the approach of that there is an API to their engine and the Storage Engine is merely glue between the database server and their API.

The BlitzDB engine has this, Innobase partially does (and my Embedded InnoDB work goes the whole way) and MySQL Cluster is likely the oldest example.

As a side note, the BlitzDB plugin should go into the main Drizzle tree fairly soon. One of the joys of having an optional plugin that doesn’t touch the core of the server is that we can do this without much worry at all.

“Does Drizzle build on Windows?” Well… no. Funnily enough though, it is increasingly easy to make a Windows port. All the platform specific things are increasingly just plugins. The build system is a sticker… and no, we’re not going to switch to CMake. The C stands for something, and it’s something that even I may not print here… (I had never thought that being able to open up automake generated Makefiles and look at them would be a feature).

This next Drizzle milestone release should be exciting though…

I look forward to having Drizzle widely deployed and relied upon… I think we’ll do well..

Writing A Storage Engine for Drizzle, Part 2: CREATE TABLE

The DDL code paths for Drizzle are increasingly different from MySQL. For example, the embedded_innodb StorageEngine CREATE TABLE code path is completely different than what it would have to be for MySQL. This is because of a number of reasons, the primary one being that Drizzle uses a protobuf message to describe the table format instead of several data structures and a FRM file.

We are pretty close to having the table protobuf message format being final (there’s a few bits left to clean up, but expect them done Real Soon Now (TM)). You can see the definition (which is pretty simple to follow) in drizzled/message/table.proto. Also check out my series of blog posts on the table message (more posts coming, I promise!).

Drizzle allows either your StorageEngine or the Drizzle kernel to take care of storage of table metadata. You tell the Drizzle kernel that your engine will take care of metadata itself by specifying HTON_HAS_DATA_DICTIONARY to the StorageEngine constructor. If you don’t specify HTON_HAS_DATA_DICTIONARY, the Drizzle kernel stores the serialized Table protobuf message in a “table_name.dfe” file in a directory named after the database. If you have specified that you have a data dictionary, you’ll also have to implement some other methods in your StorageEngine. We’ll cover these in a later post.

If you ever dealt with creating a table in MySQL, you may recognize this method:

virtual int create(const char *name, TABLE *form, HA_CREATE_INFO *info)=0;

This is not how we do things in Drizzle. We now have this function in StorageEngine that you have to implement:

int doCreateTable(Session* session, const char *path,
                  Table& table_obj,
                  drizzled::message::Table& table_message)

The existence of the Table parameter is largely historic and at some point will go away. In the Embedded InnoDB engine, we don’t use the Table parameter at all. Shortly we’ll also get rid of the path parameter, instead having the table schema in the Table message and helper functions to construct path names.

Methods name “doFoo” (such as doCreateTable) mean that there is a method named foo() (such as createTable()) in the base class. It does some base work (such as making sure the table_message is filled out and handling any errors) while the “real” work is done by your StorageEngine in the doCreateTable() method.

The Embedded InnoDB engine goes through the table message and constructs a data structure for the Embedded InnoDB library to create a table. The ARCHIVE storage engine is much simpler, and it pretty much just creates the header of the ARZ file, mostly ignoring the format of the table. The best bet is to look at the code from one of these engines, depending on what type of engine you’re working on. This code, along with the table message definition should be more than enough.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Continuing the journey

A couple of months ago (December 1st for those playing along at home) it marked five years to the day that I started at MySQL AB (now Sun, now Oracle). A good part of me is really surprised it was for that long and other parts surprised it wasn’t longer. Through MySQL and Sun, I met some pretty amazing people, worked with some really smart ones and formed really solid and awesome friendships. Of course, not everything was perfect (sometimes not even close), but we did have some fun.

Up until November 2008 (that’s 3 years and 11 months for those playing at home) I worked on MySQL Cluster. Still love the product and love how much better we’re making Drizzle so it’ll be the best SQL interface to NDB :)

The ideas behind Drizzle had been talked about for a while… and with my experience with internals of the MySQL server, I thought that some change and dramatic improvement was sorely needed.

Then, in 2008, Brian created a tree. I was soon sending in patches at nights, we announced to the whole world at OSCON and it captured a lot of attention.

Since November 2008 I’ve been working on Drizzle full time. It was absolutely awesome that I had the opportunity to spend all my days hacking on Drizzle – both directly with fantastic people and for fantastic people.

But… the Sun set… which was exciting and sad at the same time.

Never to fear! There were plenty of places wanting Drizzle hackers (and MySQL hackers). For me, it came down to this: “real artists ship”. While there were other places where I would no doubt be happy and work on something really cool, the only way I could end up working out where I should really be was: what is the best way to have Drizzle make a stable release that we’d see be suitable for deployment? So, Where Am I Now?

Rackspace.

Where I’ll again be spending all my time hacking Drizzle.

Writing A Storage Engine for Drizzle, Part 1: Plugin basics

So, you’ve decided to write a Storage Engine for Drizzle. This is excellent news! The API is continually being improved and if you’ve worked on a Storage Engine for MySQL, you’ll notice quite a few differences in some areas.

The first step is to create a skeleton StorageEngine plugin.

You can see my skeleton embedded_innodb StorageEngine plugin in its merge request.

The important steps are:

1. Create the plugin directory

e.g. mkdir plugin/embedded_innodb

2. Create the plugin.ini file describing the plugin

create the plugin.ini file in the plugin directory (so it’s plugin/plugin_name/plugin.ini)
An example plugin.ini for embedded_innodb is.

[plugin]
title=InnoDB Storage Engine using the Embedded InnoDB library
description=Work in progress engine using libinnodb instead of including it in tree.
sources=embedded_innodb_engine.cc
headers=embedded_innodb_engine.h

This gives us a title and description, along with telling the build system what sources to compile and what headers to make sure to include in any source distribution.

3. Add plugin dependencies

Your plugin may require extra libraries on the system. For example, the embedded_innodb plugin uses the Embedded InnoDB library (libinnodb).

Other examples include the MD5 function requiring either openssl or gnutls, the gearman related plugins requiring gearman libraries, the UUID() function requiring libuuid and BlitzDB requiring Tokyo Cabinet libraries.

For embedded_innodb, pandora-build has a macro for finding libinnodb on the system. We want to run this configure check, so we create a plugin.ac file in the plugin directory (i.e. plugin/plugin_name/plugin.ac) and add the check to it.

For embedded_innodb, the plugin.ac file just contains this one line:

PANDORA_HAVE_LIBINNODB

We also want to add two things to plugin.ini; one to tell the build system only to build our plugin if libinnodb was found and the other to link our plugin with libinnodb. For embedded_innodb, it’s these two lines:

build_conditional="x${ac_cv_libinnodb}" = "xyes"
ldflags=${LTLIBINNODB}
Not too hard at all! This should look relatively familiar for those who have seen autoconf and automake in the past.

Some plugins (such as the md5 function) have a bit more custom auto-foo in plugin.ini and plugin.ac (as one of two libraries can be used). You can do pretty much anything with the plugin system, but you’re a lot more likely to keep it simple like we have here.

4. Add skeleton source code for your StorageEngine

While this will change a little bit over time (and is a little long to just paste into here), you can see what I did for embedded_innodb in the skeleton-embedded-innodb-engine tree.

5. Build!

You will need to re-run ./config/autorun.sh so the build system picks up your new plugin. When you run ./configure --help afterwards, you should see options for building with/without your new plugin.

6. Add a test

You will probably want to add a test to see that your plugin loads successfully. When your plugin is built, the test suite automatically picks up any tests you have in the plugin/plugin_name/tests directory. This is in the same format as general MySQL and Drizzle tests: tests go in a t/ directory, expected results in a r/ directory.

Since we are loading a plugin, we will also need some server options to make sure that plugin is loaded. These are stored in the rather inappropriately named test-master.opt file (that’s the test name with “-master.opt” appended to the end instead of “.test“). For the embedded_innodb plugin_load test, we have a plugin/embedded_innodb/tests/t/plugin_load-master.opt file with the following content:

--plugin_add=embedded_innodb

You can have pretty much anything in the plugin_load.test file… if you’re fancy, you’ll have a SELECT query on data_dictionary.plugins to check that the plugin really is there. Be sure to also add a r/plugin_load.result file (My preferred method is to just create an empty result file, run the test suite and examine the rejected output before renaming the .reject file to .result)

Once you’ve added your test, you can run it either by just typing “make test” (which will run the whole test suite), or you can go into the main tests/ directory and run ./test-run.pl --suite=plugin_name (which will just run the tests for your plugin).

7. Check the code in, feel good about self

and you’re done. Well… the start of a Storage Engine plugin is done :)

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

NDB$INFO with SQL hits beta

Bernhard blogged over at http://ocklin.blogspot.com/2010/02/mysql-cluster-711-is-there.html that MySQL Cluster 7.1.1 Beta has been released. The big feature (from my point of view) is the SQL interface on top of NDB$INFO. This means there is now full infrastructure from the NDB data nodes right out to SQL in the MySQL Server for adding monitoring to any bit of the internals of the data nodes.

Shocked and Stunned (that code exists and does work)

#define READ_ALL		1	/* openfrm: Read all parameters */
#define EXTRA_RECORD		8	/* Reservera plats f|r extra record */

and later on….

  if (prgflag & (READ_ALL+EXTRA_RECORD))
    records++;

Feel free to think about that for a second.

(I have an urge to add this to questions asked in a job interview…)

Drizzle FRM replacement: the table proto

Drizzle originally inherited the FRM file from MySQL (which inherited it from UNIREG). The FRM file stores metadata about a table; what columns it has, what type those columns are, what indexes, any default values, comments etc are all stored in the FRM. In the days of MyISAM, this worked relatively well. The row data was stored in table.MYD, indexes on top of it in table.MYI and information about the format of the row was
in table.FRM. Since MyISAM itself wasn’t crash safe, it didn’t really matter if creating/deleting the FRM file along with the table was either.

As more sophisticated engines were introduced (e.g. InnoDB) that had their own data dictionary, there started to be more of a problem. There were now two places storing information about a table: the FRM file and the data dictionary specific to the engine. Even if the data dictionary of the storage engine was crash safe, the FRM file was not plugged into that, so you could end up in a situation where the storage engine
recovered from a crash okay, but the FRM was incorrect for what the engine recovered to. This would always require manual intervention to find out what went wrong and then fix it (in some rather unusual ways).

When the MySQL Cluster (NDB) engine was introduced, a new set of problems arose. Now the MySQL server was connecting to an existing database, where tables could be created on other nodes connected to the cluster. You now not only had the problems of crash recovery, but the problems of keeping the FRM files in sync across many nodes, requiring
all sorts of interesting solutions that, for the most part, do work.

The “obvious” solution to some of these problems would be for an engine to write out an FRM file itself. This is much easier said than done. The file format was never created to be read and written by multiple pieces of software, the code that did the reading and writing inside the server was not reusable elsewhere and the only documentation (that
wasn’t a decent chunk of the MySQL source tree) is the rather incomplete definition in the MySQL Internals wiki (http://forge.mysql.com/wiki/MySQL_Internals_File_Formats) – not nearly enough to write a correct FRM file as the specifics are very, very odd.

Our goals for reworking the metadata system in Drizzle were: to allow engines to own their own metadata (removing any opportunity to have inconsistencies between the engine and the ‘FRM’) and for engines without their own data dictionary, to replace the FRM file format with something simple and well documented.

One option was to use SQL as the standard storage format, but it is rather non-trivial and expensive to parse – especially if we were to use it as the preferred way of talking table definitions with storage engines. We had been looking at the protobuf library
(http://code.google.com/p/protobuf/) ever since its first release and it has a number of very nice characteristics: a description language of a data structure that is then used to generate APIs for reading and writing it in a number of programming languages and a standard (documented) way to serialize the data structure.

After a bit of discussion, we arrived at a good outline for the table definition proto. The current one can always be found in the Drizzle source tree at drizzled/message/table.proto. The current format is very close to final (i.e. one that we’ll suppport upgrades from).

The process of modifying the Drizzle code base so that it would write (and read) a file format different to the FRM isn’t worth going too much into here although there were some interesting hurdles to overcome. An interesting one was the FRM file contains a binary image of the default row for the table (which is in the row format that the server uses); we now store the default value for each column in the proto and generate the default row when we read the proto. Another interesting one was removing and refactoring “pack_flag” – the details of which should only be extracted from Jay or Stewart with a liberal application of fine ale.

The end result is that we now have storage engines that are completely responsible for their own metadata. One example is the ARCHIVE engine. In the CREATE TABLE code path, the ARCHIVE storage engine gets the table definition in an object that represents the table proto. It can examine the parameters it needs to and then either store the proto directly, or convert it into its own format. Since ARCHIVE is simple, it just stores
the table proto in a serialised form (using a standard function provided by the protobuf library) and stores it in the .ARZ data file for the table. This instantly makes the ARCHIVE storage engine crash safe for CREATE and DROP table as there is only 1 file on disk, so no two files to get out of sync.

If an engine does not have its own data dictionary, it can still use the default implementation which just stores the serialised table proto in a file on disk.

We can also now use this interface to move INFORMATION_SCHEMA into its own storage engine. This means we can remove a lot of special case code throughout the server for INFORMATION_SCHEMA and instead just have a INFORMATION_SCHEMA storage engine that says it has the following tables in the INFORMATION_SCHEMA database. Because the table definition is now in a documented format with a standard API, this becomes a relatively trivial exercise.

What we’re all looking forward to is when the InnoDB data dictionary is linked into the new interface and we can have a truly crash safe database server.

Another wonderful side effect is since we now have a standard data structure for representing a table definition, we can integrate this with the replication system. In the “near” future, we can represent a CREATE TABLE in the replication stream as a table proto and not the raw SQL. If you were wanting to apply the replication stream to a different database server, you then only have to write a table proto to SQL
converter. If the target database system doesn’t do SQL at all, you could generate API calls to create the table.

So we now have a rather flexible system in place, with the code implementing it being increasingly simple and possible to be “obviously correct”.

Things that easily fall out of this work that people have written about:
– CREATE TABLE LIKE with ENGINE clause
http://krow.livejournal.com/671235.html
– table_raw_reader – looking at the raw representation of table metadata
http://www.flamingspork.com/blog/2009/10/01/table_raw_reader-reading-the-table-proto-from-disk-and-examining-everything/
– Table discovery
http://www.flamingspork.com/blog/2009/07/29/table-discovery-for-drizzle-take-2-now-merged/

Some more info:
http://krow.livejournal.com/642329.html

Bazaar importmbox plugin

Releasing and announcing software is win! I’ve had this bumming around for a bit, and for me (and I think others hacking on MySQL) it’s been rather useful. Simple plugin that takes each email in an mbox, applies the patch and commits it with the correct author to a bzr repo. Very useful if you use quilt and bzr together (“quilt mail –mbox” and then “bzr importmbox”).

I finally published it up at:

http://launchpad.net/bzr-importmbox

enjoy.

Return of the “Top 5 MySQL Wishlist” and looking at Drizzle

It’s coming up on a year since I started working full time on Drizzle. So, I got a bit reflective…

Have we done things that I (and others) really wanted done? Back in 2007, I wrote my top 5 wishlist for the MySQL Server.

I am not going to pretend I speak for the MySQL development team; I’m just trying to evaluate how Drizzle is doing against some wishlists that (to me) embodied some of the reasons we started Drizzle.

Please think of this as “database server wishlists” and comparing them against Drizzle….

My wishlist was:

5. Six-monthly release cycles

Done. Not only does Drizzle have milestone releases, but we’re also dropping tarballs every two weeks (currently for the bell milestone). We’re also doing a decent job of keeping trunk free of massive breakage.

4. Much more in depth automated testing

Done (and in progress). We have drizzle-automation running things all the time. Hudson (and buildbot) test across many platforms (what pushbuild did and more when inside MySQL) before code hits trunk. We also have regular performance benchmarks that we compare across versions, crash-me, the random query generator as well as checking that we don’t regress in code size (via sloccount).

3. Sane build system

(slightly distorting my original words). Well, we’re not quite at the ready for packaging in distributions stage, but a debug or non debug build is just -g and optimization level for the compiler, plugins are using autofoo to work otu if they can be built… so yeah, this is pretty sane.

We also are building with -Werror (and more!) which increases code quality no end.

So, mostly done.

3.5. (yes, i have a 3.5): Kill HPUX

Done.

2. Increased liberal use of asserts

An in-progress thing, but the better compiler warnings have won us a lot.

1. Pluggable data dictionary

Not only that, but done away with FRM totally. Really happy with this.

What about other peoples wishlist?

Kostya had one:

1. Remove excessive fuss.

i.e. “just do it”.

I think we’re doing really well with this for Drizzle. Plugins are pretty easy to get merged, and if your patches to the kernel are good, they’re also easy. Big changes can be harder, but in the end it has turned out well.

2. Open the development process.

Done. There is no internal wiki, there are no “committers” versus “non-committers”.. everything is judged on merit of the idea/code. Sometimes the most valuable contribution is somebody telling you their real world experience.

3. Get to a normal release schedule.

Done.

4. Establish productive relationship with the majority of users.

I think drizzle-discuss mailing list is doing quite well in this regard. Is quite active with discussion.

5. Find a way to do incompatible changes with minimal pain for users.

We’ll see how we go :)

Ronald also had a list:

1. Real time Query Monitoring

With gearman logging and my recent experimentation with using CPU performance counters I think we’ll end up somewhere rather awesome.

If you’re looking for MySQL monitoring though, the MySQl Enterprise query monitoring stuff looks pretty good to me.

2. Consistent Release Cycles

We’re doing pretty well so far!

3. INFORMATION_SCHEMA Extensions

We’ve inherited the architecture from MySQL 5.1 (and 6.0) of being able to pretty easily add INFORMATION_SCHEMA tables and improved it. It’s currently pretty easy to add them. We also have ongoing work having an INFORMATION_SCHEMA storage engine which means that you won’t have to have the I_S tables be materialized every time you query them.

4. Online table maintenance

All progress has been due to Storage Engine authors. With the data dictionary work though, this gets easier and saner to do.

5. Published benchmarks

We’re encouraging others who will be more objective :) Although we also do regular performance regression tests as part of our standard development process.

Dormando also had a list (complete with “there is no five”):

1) Logical separation of connections from threads

We have this in Drizzle through plugins. Interesting ones are pool_of_threads (fixed number of threads), multithread (thread per connection) and single_thread (one thread).

2) A more modular core

We’re very much doing well here. It’s a long process, but I’m quite impressed by our progress.

3) Better replication (better replication management/protocol?)

The work being done on Drizzle replication is really exciting. I love the fact that modularity is encouraged and the ability to replace any bit you want easily, as well as read the replication stream in about any language you want.

4) Better test suite

I may never be 100% happy with a test suite, but we’re doing good…

PeterZ‘s view is always interesting, and he had one too:

1. Be Pluggable

Check. (and also, of course, in progress)

2. Be Scalable

We’ve done a lot of work scaling on many CPUs and many connections. Really, 8 concurrent database connections just isn’t interesting. We ever run as part of our regression suite up to 2048 concurrent connections.

3. Be Distributed

With the new protocol work to have built in sharding, plugins for logging and replication via Gearman, we’re getting better.

4. Be Solid

This will be a test for us. I think we should end up pretty good because of a number of reasons:

    • clearer, easier to understand code without nasty side effects or really odd things (e.g. relying on a bool storing the value 2)
    • Better modularity (a module you don’t use and don’t load cannot screw you up)
    • Smaller core and removal of problematic features.
    • All the testing stuff I’ve previously mentioned.

So I hope we’re going to be okay here.

5. Don’t forget about the roots

The group of us working at Sun on Drizzle have said we want to focus on being awesome for large scale Web apps while enabling others to make Drizzle good for other things. I think this is the right approach to not forget our roots (and target users) while allowing it to be awesome for any use somebody wants to have of it.

From a Storage Engine author PoV, Paul had some insights while thinking about PBXT:

1. A generic engine test suite

We’re doing pretty well… the whole Drizzle test suite runs with InnoDB, and doesn’t require much change to get going with another transactional engine. The proof is in the Drizzle PBXT branch! But I also think we could do better and have a test suite more directed at each part of an engine (including error cases!).

2. Internal APIs

Paul mentions FRMs, which are gone :) In their place is a simple interface that engines can implement for ever increased functionality (i.e. they own their own metadata). We’re getting better in other places too.

3. Customizable table and column attributes

MariaDB has this now, and we have space in our table definition proto, but not at the parser level (yet).

4. Push-down restrict and join conditions.

Not yet, and not for a little while.

5. Custom data types

It’ll be great when we rework the type system even more so that this really is as easy as it should be – not only from a SQL level but also for adding new types as server modules.

Finally, Antony’s wish list:

1. Modular Architecture

Covered.

2. libmysys as a separate project

We’ve removed it where we can, and are using gnulib where we can. It very much improves the situation when you ditch weird-ass platforms and assume some level of POSIX.

3. New/modular parser

We’re getting close to a stage where you could load a different parser… not there, but relatively close. It would still be messy, but a lot better than even 6 months ago.

4. Unit tests for server components

With our move towards modularity, this is actually getting possible!

5. Aggregate Stored functions and External Stored Procedures

We don’t currently have either, we have decent thoughts though.

Antony also cheated and added a few more:

A new Recursive Descent parser

The required work to be able to replace the parser is in progress.

Abstract Syntax Trees

See above.. getting the pre-work done.

SCTP and/or link aggregation

We’ll see improvements around this with the new drizzle protocol.

Parsing within the client

We’ve had some very good discussion on the drizzle-discuss list. We’ll no doubt have something to help remove more of the cycles used in executing a query.

Integrated Federation

That’s the game plan :)

Elimination of FRM Files

I never get tired of saying this is done :)

Elimination of errmsg.sys

We’re now just using gettext, like every other free software project on the planet. Although I think we could take a few steps in making errors more easily parsable by code.

So how do we stack up?

I think we’re doing pretty well. There’s still a lot of work to get where we want to be, but it’s amazing how much progress we’ve made in the short time we’ve been around.

I also just realized I missed Jay’s list… but we’re doing pretty well there too.

MySQL Storage Engine SLOCCount over releases

For a bit more info, what about various storage engines over MySQL releases. Have they changed much? Here we’re looking at the storage/X/ directory for code, so for some engines this excludes the handler that interfaces with the MySQL Server.

You can view the data on the spreadsheet.