Shocked and Stunned (that code exists and does work)

#define READ_ALL		1	/* openfrm: Read all parameters */
#define EXTRA_RECORD		8	/* Reservera plats f|r extra record */

and later on….

  if (prgflag & (READ_ALL+EXTRA_RECORD))
    records++;

Feel free to think about that for a second.

(I have an urge to add this to questions asked in a job interview…)

Bullshit Tiger. bullshit.

Screenshot-3

It also took them over an hour to make the announcement that the flight had been delayed. The screens just flashed “Boarding” for an hour (and they still are).

Delays are fine. Complete and total lack of any communication whatsoever is not.

Simply jumping on a microphone and saying “we’re experiencing a delay, we’ll have more information for you in half an hour” would be better. We seriously heard nothing for a whole hour after “Boarding” had started.

Just a closed door.

Somehow I feel this will be the last time I fly this airline.

The Table protobuf message format

If you’ve ever opened up drizzled/message/table.proto in the Drizzle source tree you will have seen what’s in the table message: the structure that describes a database table in Drizzle. Previously I’ve talked about the Table message more generally, giving a fair bit of history of the FRM file and how we’ve replaced it with both the Table protobuf message and an infrastructure inside Drizzle so that Storage Engines own their own metadata.

Yesterday I talked about the Schema protobuf message format in more detail, and this time I’m talking about the Table protobuf message in a similar amount.

The first time we were loading (then only part of) the table definition out of a protobuf message was way back in January 2009 (I blogged about it too). It was an adventure untangling all sorts of things to get to a much nicer place (where we are now). The code in the server is not perfect… I’ll be the first to admit that some of it is rather strange, but that’s mostly all behind the scenes for people interested in the protobuf Table message!

The Table message has several embedded messages in it too. We need to have information on the Storage Engine, Fields and Indexes (and each of those can have other properties). It is much more complex than the simple Schema message.

Let’s have a look at the basic structure of the Table message:

message Table {

    /* *SNIP* (Here goes the definitions for TableType, StorageEngine, Field, Index, ForeignKeyConstrain, TableOptions and TableStats) */

  required string name = 1;
  required TableType type = 5;
  required StorageEngine engine = 2;
  repeated Field field = 3;
  repeated Index indexes = 4;

  repeated ForeignKeyConstraint fk_constraint = 8;
  optional TableOptions options = 9;
  optional TableStats stats = 10;
}

(We’ve skipped the definitions for the embedded messages for now)

This seems all pretty logical; a table has a name, a type, is in a Storage Engine, has Fields, may have Indexes, may have foreign key constraints, it has some options and statistics (the statistics may go away at some point “soon”).

Let’s have a look at the TableType message definition:

  enum TableType {
    STANDARD = 0;
    TEMPORARY = 1;
    INTERNAL = 2;
  }

It’s pretty simple, the table type is either a standard table (what you get from CREATE TABLE), a temporary table (what you get from CREATE TEMPORARY TABLE) or an INTERNAL table (what you get when Drizzle uses a temporary table during query execution).

Next, the StorageEngine message:

  message StorageEngine {

    message EngineOption {
      enum EngineOptionType {
        BOOL = 0;
        INTEGER = 1;
        STRING = 2;
      }

      required string option_name = 1;
      required string option_value = 2;
      required EngineOptionType option_type = 3;
    }

    required string name = 1;
    repeated EngineOption option = 2;
  }

The main part is the “name” member, which is just the name of the storage engine (e.g. “PBXT”,  “INNODB”, “ARCHIVE”). We do however have support specified in the StorageEngine message for engine specific options (in key value form). Expect these to be used more in the near future.

Specifying Fields is probably the most complex part of the table message. The Field message looks like this (with many embedded messages):

message Field {
    required string name = 1;
    required FieldType type = 2;
    optional FieldFormatType format = 3;
    optional FieldOptions options = 4;
    optional FieldConstraints constraints = 5;
    optional NumericFieldOptions numeric_options = 6;
    optional StringFieldOptions string_options = 7;

    optional string comment = 16; /* Reserve 0-15 for frequently accessed attributes */
    optional SetFieldOptions set_options = 17;
    optional TimestampFieldOptions timestamp_options = 18;
}

So… what does this all mean? Well, Fields have a type, they’re stored in a format, there’s options attached to them, there may be constraints as well as field type specific options.

The different field types should be fairly familiar by now:

    enum FieldType {
      DOUBLE = 0;
      VARCHAR = 1;
      BLOB = 2;
      ENUM = 3;
      INTEGER = 4;
      BIGINT = 5;
      DECIMAL = 6;
      DATE = 7;
      TIME = 8;
      TIMESTAMP = 9;
      DATETIME = 10;
    }

We also allow fields in different formats. Currently, these are default, fixed and dynamic. The idea is you can tell the engine (or the engine can tell you) how it’s storing the field. This is currently here as a nicety and the users for this are few and far between.

    enum FieldFormatType {
      DefaultFormat= 0;
      FixedFormat= 1;
      DynamicFormat= 2;
    }

The FieldOptions get interesting though:

    message FieldOptions {
      optional string default_value = 1;
      optional string update_value = 2;
      optional bool default_null = 3 [default = false];
      optional bytes default_bin_value = 4;
    }

You’ll no doubt be intrigued by the existence of both “default_value” and “default_bin_value”. Ordinarily, using a string to contain a textual representation of the default value (e.g. “foo” or “42”) is fine. However, for BLOB columns, you can have defaults that aren’t representable in a text string, you need binary data (e.g. the default value contains ‘\0’).

For TIMESTAMP columns, we continue to support DEFAULT NOW() and the ability to update the timestamp column on UPDATE. How is this represented in the table message? Well… default_value will be “NOW()” and update_value will be “NOW()”. It is intended that in the future it will be possible to have arbitrary SQL expressions for these. This does, of course, require support in the Drizzle server.

The default_null bool should be rather obvious :)

Well… that’s enough for today. Next time: more of the Field message!

The Schema protobuf message: Drizzle’s metadata on a schema

I’ve previously talked about table metadata in Drizzle and how we use the table protobuf message to describe a table (see Drizzle FRM Replacement and others). The model in Drizzle is that the engine is responsible for its metadata. For schemas (you may be thinking ‘database’ but we’re moving to the Schema terminology in Drizzle) we also have a small amount of metadata.

The protobuf message is specified in drizzled/message/schema.proto and is incredibly short. In fact, here it is in its entirety:

1
package drizzled.message;
2
option optimize_for = SPEED;
3
4
message Schema {
5
  required string name = 1;
6
  optional string collation = 2;
7
}

We don’t keep an awful lot of metadata about schemas. A Schema has a name and it has a default collation.

You can also read the db.opt file directly using the provided (and very simple) schema_reader utility.

In the near future, we could have CREATE DATABASE and CREATE SCHEMA replicated via this protobuf message. This would make it extremely easy to parse for utilities parsing the replication stream.

We’ll also (rather shortly) have key,value pairs for options to CREATE SCHEMA/CREATE DATABASE. More on that later :)

Photos of Burning Man: Getting to Black Rock City

This year was my first burn. More amazing than I could have imagined. I think it was day two when Brian caught me saying “so, next year what we’re going to do…”

Due to the harsh environmental conditions, I wasn’t too keen on the idea of taking my digital gear (it ain’t cheap) and had the idea of handing cameras to people and having a kind of communal photo album (planning for a larger scale implementation of this next year). So… I went purely film. Several older and smaller compact 35mm cameras that I picked up either for nothing or next to (no loss if lost or dead!) plus a Ricoh SLR was my arsenal.

Now… that means I need film.

I mainly shot the new Kodak Ektar 100 and a Kodak Ektachrome E100VS. For the smaller, cheap 35mm ones, I just used some Fuji Superia.

trees in Portland

Trees in Portland

Firstly though, there was a stop in Portland to a) recover from jetlag and b) hang out with Eric, Wendy and their dogs. I do like Portland, quite a lot actually. While there, managed to get some work done, fiddle with some SPARC hardware that Eric has, enjoy excellent vegan ice cream, enjoy awesome vegan food (both at home and out) and walk around both downtown and up in the hills. Portland (and Oregon) is certainly pretty.

trees in seattle

Trees in Seattle

Before heading to Burning Man, I was in Seattle, where Leah joined me to prep for (and then go to) Burning man.

Then Leah and I were in San Francisco for a day. This is when I started shooting exclusively film for the first time in… well.. Since 2002 (I got my first digital camera for linux.conf.au in 2003)

05

Leah enjoying going around San Francisco

We actually did some touristy things… so I saw a bit more of San Francisco than I have before. One issue with mostly being around San Francisco just before/after the MySQL Users Conference is a severe lack of time/energy to go for much exploring. Preparing is exhausting, and afterwards I just want to really chill out – usually heading out to some forest or down to Santa Cruz or just hanging out with cool people.

Across the Golden gate

Going over the Golden Gate Bridge

One of the most surprising things was running into David while just walking down the street. Although knew he was in town, and we’d planned to all go down to Burning Man together, actually running into somebody in the street always surprises me.

I’m pretty sure this was the first time that I went across the Golden Gate Bridge. Seen it, taken photos of it, used said photos as my desktop background, but this was the first time going across it and looking back on the city. Manual focus, moving bus: epic amounts of fun… I’d kinda forgotten how much fun this could be.

Sun going down in San Francisco

Sun going down in San Francisco

There are things I like about San Francisco, but if I had to call somewhere in the US home, Seattle and Portland are both much higher on the list. Maybe it’s because of the wonderful coffee of Seattle, or the laid backness and awesome vegan food of Portland or if I’m just delusional and think it’d be possible to catch a Nirvana gig in Seattle.

In the evening in San Francisco we met up with David again and went down to the beach. It was pretty. Somehow, I deluded myself into thinking “ISO 100 Film, no tripod, cold, sunset…. Photo time!” I did get one I quite like too:

The Sun setting over the water, San Francisco

The Sun setting over the water, San Francisco

After what can only be described as a “I love side impact airbags” car crash on the way back to the Hotel (everybody okay: shaken, not stirred. also not our fault), headed back for a stiff drink, some sleep and eagerly awaiting the drive to Reno and then Black Rock City.

Waiting in line to get into Black Rock City (for many, many hours)

Waiting in line to get into Black Rock City (for many, many hours)

It was not a short wait once we got to the gate. We did, however, not too long after sunrise, make it to camp. I have no idea where I shot this from… but it was before we got to camp (or at least the first photo of us helping to set up):

People arriving at Black Rock City: first thing in the morning

People arriving at Black Rock City: first thing in the morning

More to come… including setup of Pi Camp!

Drizzle FRM replacement: the table proto

Drizzle originally inherited the FRM file from MySQL (which inherited it from UNIREG). The FRM file stores metadata about a table; what columns it has, what type those columns are, what indexes, any default values, comments etc are all stored in the FRM. In the days of MyISAM, this worked relatively well. The row data was stored in table.MYD, indexes on top of it in table.MYI and information about the format of the row was
in table.FRM. Since MyISAM itself wasn’t crash safe, it didn’t really matter if creating/deleting the FRM file along with the table was either.

As more sophisticated engines were introduced (e.g. InnoDB) that had their own data dictionary, there started to be more of a problem. There were now two places storing information about a table: the FRM file and the data dictionary specific to the engine. Even if the data dictionary of the storage engine was crash safe, the FRM file was not plugged into that, so you could end up in a situation where the storage engine
recovered from a crash okay, but the FRM was incorrect for what the engine recovered to. This would always require manual intervention to find out what went wrong and then fix it (in some rather unusual ways).

When the MySQL Cluster (NDB) engine was introduced, a new set of problems arose. Now the MySQL server was connecting to an existing database, where tables could be created on other nodes connected to the cluster. You now not only had the problems of crash recovery, but the problems of keeping the FRM files in sync across many nodes, requiring
all sorts of interesting solutions that, for the most part, do work.

The “obvious” solution to some of these problems would be for an engine to write out an FRM file itself. This is much easier said than done. The file format was never created to be read and written by multiple pieces of software, the code that did the reading and writing inside the server was not reusable elsewhere and the only documentation (that
wasn’t a decent chunk of the MySQL source tree) is the rather incomplete definition in the MySQL Internals wiki (http://forge.mysql.com/wiki/MySQL_Internals_File_Formats) – not nearly enough to write a correct FRM file as the specifics are very, very odd.

Our goals for reworking the metadata system in Drizzle were: to allow engines to own their own metadata (removing any opportunity to have inconsistencies between the engine and the ‘FRM’) and for engines without their own data dictionary, to replace the FRM file format with something simple and well documented.

One option was to use SQL as the standard storage format, but it is rather non-trivial and expensive to parse – especially if we were to use it as the preferred way of talking table definitions with storage engines. We had been looking at the protobuf library
(http://code.google.com/p/protobuf/) ever since its first release and it has a number of very nice characteristics: a description language of a data structure that is then used to generate APIs for reading and writing it in a number of programming languages and a standard (documented) way to serialize the data structure.

After a bit of discussion, we arrived at a good outline for the table definition proto. The current one can always be found in the Drizzle source tree at drizzled/message/table.proto. The current format is very close to final (i.e. one that we’ll suppport upgrades from).

The process of modifying the Drizzle code base so that it would write (and read) a file format different to the FRM isn’t worth going too much into here although there were some interesting hurdles to overcome. An interesting one was the FRM file contains a binary image of the default row for the table (which is in the row format that the server uses); we now store the default value for each column in the proto and generate the default row when we read the proto. Another interesting one was removing and refactoring “pack_flag” – the details of which should only be extracted from Jay or Stewart with a liberal application of fine ale.

The end result is that we now have storage engines that are completely responsible for their own metadata. One example is the ARCHIVE engine. In the CREATE TABLE code path, the ARCHIVE storage engine gets the table definition in an object that represents the table proto. It can examine the parameters it needs to and then either store the proto directly, or convert it into its own format. Since ARCHIVE is simple, it just stores
the table proto in a serialised form (using a standard function provided by the protobuf library) and stores it in the .ARZ data file for the table. This instantly makes the ARCHIVE storage engine crash safe for CREATE and DROP table as there is only 1 file on disk, so no two files to get out of sync.

If an engine does not have its own data dictionary, it can still use the default implementation which just stores the serialised table proto in a file on disk.

We can also now use this interface to move INFORMATION_SCHEMA into its own storage engine. This means we can remove a lot of special case code throughout the server for INFORMATION_SCHEMA and instead just have a INFORMATION_SCHEMA storage engine that says it has the following tables in the INFORMATION_SCHEMA database. Because the table definition is now in a documented format with a standard API, this becomes a relatively trivial exercise.

What we’re all looking forward to is when the InnoDB data dictionary is linked into the new interface and we can have a truly crash safe database server.

Another wonderful side effect is since we now have a standard data structure for representing a table definition, we can integrate this with the replication system. In the “near” future, we can represent a CREATE TABLE in the replication stream as a table proto and not the raw SQL. If you were wanting to apply the replication stream to a different database server, you then only have to write a table proto to SQL
converter. If the target database system doesn’t do SQL at all, you could generate API calls to create the table.

So we now have a rather flexible system in place, with the code implementing it being increasingly simple and possible to be “obviously correct”.

Things that easily fall out of this work that people have written about:
– CREATE TABLE LIKE with ENGINE clause
http://krow.livejournal.com/671235.html
– table_raw_reader – looking at the raw representation of table metadata
http://www.flamingspork.com/blog/2009/10/01/table_raw_reader-reading-the-table-proto-from-disk-and-examining-everything/
– Table discovery
http://www.flamingspork.com/blog/2009/07/29/table-discovery-for-drizzle-take-2-now-merged/

Some more info:
http://krow.livejournal.com/642329.html

Bazaar importmbox plugin

Releasing and announcing software is win! I’ve had this bumming around for a bit, and for me (and I think others hacking on MySQL) it’s been rather useful. Simple plugin that takes each email in an mbox, applies the patch and commits it with the correct author to a bzr repo. Very useful if you use quilt and bzr together (“quilt mail –mbox” and then “bzr importmbox”).

I finally published it up at:

http://launchpad.net/bzr-importmbox

enjoy.

Learning Drizzle

Wanting to learn Drizzle? My friend Kent is doing just that, and documenting everything as he goes. It’s a great time to learn, as we head up to the bell milestone (which will be awesome) and big user visible changes are certainly slowing.

So, head on over to Learning Drizzle and learn Drizzle with Kent. Should be awesome.

First flesh wound in create_tmp_table()

If you have needed a good reason to drink heavily and forget, may I suggest taking a look at create_tmp_table() and those who call it. It’s probably one of the best illustrations of rot and awful, incomprehensible APIs in the server (Drizzle inherited it from MySQL).

In the normal paths for CREATE TABLE, you construct a data structure describing the table you want, create it and then open it. Opening a table gives you objects you can use to access it.

create_tmp_table() instead constructs these objects directly and then does some direct calls into specific storage engines (ever wondered why you can’t use your own storage engine for temporary tables created during query execution? this is why). So instead of having one place to construct TableShare and Table, we have create_tmp_table() doing its own thing.

I struck a blow against it today. Using the standard interfaces to create and open a table, I got the temporary tables created for INFORMATION_SCHEMA tables to use them and not create_tmp_table().

When I finish converting the other 3 or 4 places that use create_tmp_table, I’ll be able to cut out at least 700 odd lines of code, cut out the horrible special case in the HEAP engine for “create table on open() even if create() was never called” and have an easy time making the transition to fully using the table proto as the specification for table structure.

Return of the “Top 5 MySQL Wishlist” and looking at Drizzle

It’s coming up on a year since I started working full time on Drizzle. So, I got a bit reflective…

Have we done things that I (and others) really wanted done? Back in 2007, I wrote my top 5 wishlist for the MySQL Server.

I am not going to pretend I speak for the MySQL development team; I’m just trying to evaluate how Drizzle is doing against some wishlists that (to me) embodied some of the reasons we started Drizzle.

Please think of this as “database server wishlists” and comparing them against Drizzle….

My wishlist was:

5. Six-monthly release cycles

Done. Not only does Drizzle have milestone releases, but we’re also dropping tarballs every two weeks (currently for the bell milestone). We’re also doing a decent job of keeping trunk free of massive breakage.

4. Much more in depth automated testing

Done (and in progress). We have drizzle-automation running things all the time. Hudson (and buildbot) test across many platforms (what pushbuild did and more when inside MySQL) before code hits trunk. We also have regular performance benchmarks that we compare across versions, crash-me, the random query generator as well as checking that we don’t regress in code size (via sloccount).

3. Sane build system

(slightly distorting my original words). Well, we’re not quite at the ready for packaging in distributions stage, but a debug or non debug build is just -g and optimization level for the compiler, plugins are using autofoo to work otu if they can be built… so yeah, this is pretty sane.

We also are building with -Werror (and more!) which increases code quality no end.

So, mostly done.

3.5. (yes, i have a 3.5): Kill HPUX

Done.

2. Increased liberal use of asserts

An in-progress thing, but the better compiler warnings have won us a lot.

1. Pluggable data dictionary

Not only that, but done away with FRM totally. Really happy with this.

What about other peoples wishlist?

Kostya had one:

1. Remove excessive fuss.

i.e. “just do it”.

I think we’re doing really well with this for Drizzle. Plugins are pretty easy to get merged, and if your patches to the kernel are good, they’re also easy. Big changes can be harder, but in the end it has turned out well.

2. Open the development process.

Done. There is no internal wiki, there are no “committers” versus “non-committers”.. everything is judged on merit of the idea/code. Sometimes the most valuable contribution is somebody telling you their real world experience.

3. Get to a normal release schedule.

Done.

4. Establish productive relationship with the majority of users.

I think drizzle-discuss mailing list is doing quite well in this regard. Is quite active with discussion.

5. Find a way to do incompatible changes with minimal pain for users.

We’ll see how we go :)

Ronald also had a list:

1. Real time Query Monitoring

With gearman logging and my recent experimentation with using CPU performance counters I think we’ll end up somewhere rather awesome.

If you’re looking for MySQL monitoring though, the MySQl Enterprise query monitoring stuff looks pretty good to me.

2. Consistent Release Cycles

We’re doing pretty well so far!

3. INFORMATION_SCHEMA Extensions

We’ve inherited the architecture from MySQL 5.1 (and 6.0) of being able to pretty easily add INFORMATION_SCHEMA tables and improved it. It’s currently pretty easy to add them. We also have ongoing work having an INFORMATION_SCHEMA storage engine which means that you won’t have to have the I_S tables be materialized every time you query them.

4. Online table maintenance

All progress has been due to Storage Engine authors. With the data dictionary work though, this gets easier and saner to do.

5. Published benchmarks

We’re encouraging others who will be more objective :) Although we also do regular performance regression tests as part of our standard development process.

Dormando also had a list (complete with “there is no five”):

1) Logical separation of connections from threads

We have this in Drizzle through plugins. Interesting ones are pool_of_threads (fixed number of threads), multithread (thread per connection) and single_thread (one thread).

2) A more modular core

We’re very much doing well here. It’s a long process, but I’m quite impressed by our progress.

3) Better replication (better replication management/protocol?)

The work being done on Drizzle replication is really exciting. I love the fact that modularity is encouraged and the ability to replace any bit you want easily, as well as read the replication stream in about any language you want.

4) Better test suite

I may never be 100% happy with a test suite, but we’re doing good…

PeterZ‘s view is always interesting, and he had one too:

1. Be Pluggable

Check. (and also, of course, in progress)

2. Be Scalable

We’ve done a lot of work scaling on many CPUs and many connections. Really, 8 concurrent database connections just isn’t interesting. We ever run as part of our regression suite up to 2048 concurrent connections.

3. Be Distributed

With the new protocol work to have built in sharding, plugins for logging and replication via Gearman, we’re getting better.

4. Be Solid

This will be a test for us. I think we should end up pretty good because of a number of reasons:

    • clearer, easier to understand code without nasty side effects or really odd things (e.g. relying on a bool storing the value 2)
    • Better modularity (a module you don’t use and don’t load cannot screw you up)
    • Smaller core and removal of problematic features.
    • All the testing stuff I’ve previously mentioned.

So I hope we’re going to be okay here.

5. Don’t forget about the roots

The group of us working at Sun on Drizzle have said we want to focus on being awesome for large scale Web apps while enabling others to make Drizzle good for other things. I think this is the right approach to not forget our roots (and target users) while allowing it to be awesome for any use somebody wants to have of it.

From a Storage Engine author PoV, Paul had some insights while thinking about PBXT:

1. A generic engine test suite

We’re doing pretty well… the whole Drizzle test suite runs with InnoDB, and doesn’t require much change to get going with another transactional engine. The proof is in the Drizzle PBXT branch! But I also think we could do better and have a test suite more directed at each part of an engine (including error cases!).

2. Internal APIs

Paul mentions FRMs, which are gone :) In their place is a simple interface that engines can implement for ever increased functionality (i.e. they own their own metadata). We’re getting better in other places too.

3. Customizable table and column attributes

MariaDB has this now, and we have space in our table definition proto, but not at the parser level (yet).

4. Push-down restrict and join conditions.

Not yet, and not for a little while.

5. Custom data types

It’ll be great when we rework the type system even more so that this really is as easy as it should be – not only from a SQL level but also for adding new types as server modules.

Finally, Antony’s wish list:

1. Modular Architecture

Covered.

2. libmysys as a separate project

We’ve removed it where we can, and are using gnulib where we can. It very much improves the situation when you ditch weird-ass platforms and assume some level of POSIX.

3. New/modular parser

We’re getting close to a stage where you could load a different parser… not there, but relatively close. It would still be messy, but a lot better than even 6 months ago.

4. Unit tests for server components

With our move towards modularity, this is actually getting possible!

5. Aggregate Stored functions and External Stored Procedures

We don’t currently have either, we have decent thoughts though.

Antony also cheated and added a few more:

A new Recursive Descent parser

The required work to be able to replace the parser is in progress.

Abstract Syntax Trees

See above.. getting the pre-work done.

SCTP and/or link aggregation

We’ll see improvements around this with the new drizzle protocol.

Parsing within the client

We’ve had some very good discussion on the drizzle-discuss list. We’ll no doubt have something to help remove more of the cycles used in executing a query.

Integrated Federation

That’s the game plan :)

Elimination of FRM Files

I never get tired of saying this is done :)

Elimination of errmsg.sys

We’re now just using gettext, like every other free software project on the planet. Although I think we could take a few steps in making errors more easily parsable by code.

So how do we stack up?

I think we’re doing pretty well. There’s still a lot of work to get where we want to be, but it’s amazing how much progress we’ve made in the short time we’ve been around.

I also just realized I missed Jay’s list… but we’re doing pretty well there too.

Random Query Generator added to Drizzle Automation

As Lee announced, we have the Random Query Generator added to Drizzle Automation. It always amazed me that we were lacking such a fundamental testing tool for MySQL for all that time. I always found the similar (NDB API) tools for MySQL Cluster (NDB) to be really, really useful when wanting to make sure your code changes, well, worked.

I’m really looking forward into this being developed further as a cross-database testing tool and framework.

Also, upstream maintainers++ Good example of how even small FOSS projects should work.

How many CPU cycles does a SQL query take? (or pagefaults caused… or L2 cache misses… or CPU migrations…)

I like profilers. I use them when trying to make software (such as Drizzle) faster. Many profilers suck – and pretty much all of them are impossible to attach to a running system. Two notable exceptions are oprofile and dtrace (for Linux and Solaris respectively). The downside of oprofile is that it is non trivial to configure and get running and is pretty much all or nothing. Dtrace has the major disadvantage of that it is Solaris specific, so is only available to a minority of our users (and developers).

The new Linux Performance Events interface (perf_event) presents to userspace a nice abstraction of the hardware Performance Monitoring Unit inside the CPU. Typically these are processor specific (i.e. the one in a Core is different than the one in a Core 2) and can only be used by one thing at a time. The perf_events interface lets multiple applications/threads use the PMU (switching state at context switch as needed), even giving us ratios of how much time we got events for so we can do realistic estimates. It also provides some common defines to ask for things like CPU Cycles (the value you program into the PMU differs per CPU architecture for that, so an abstraction is rather welcome: it means we don’t need to have #ifdef __powerpc__ in our userspace code to support PowerPC, just a kernel that does).

Since perf_events gives us an interface to only get counts for our thread, we can map this onto connected sessions to Drizzle (and if we were using a pool_of_threads type scheduler in Drizzle, we’d need a bit of extra code to get things right, but with a thread per connection scheduler, we get it for free). A simple use of this could be to answer the question “How many CPU cycles does this SQL query take?” with the condition that you do not want how many CPU cycles were spent executing other things (e.g. the 1,000 other SQL queries currently being executed on your database server).

Many of you will now point out the RDTSC instruction for the x86 and ask why I’m just not using it. With RDTSC you’re only getting “how many cycles since reboot”. So using two values from the TSC and finding the difference only tells you how many cycles were between the two reads, not how many cycles were spent executing your thread/process. So the value of “cycles executed” gathered with RDTSC varies between a loaded and non-loaded system. With perf_events, it does not.

So…. after talking to paulus about perf_events, I decided to see how I could plug this into Drizzle to start getting interesting numbers out. Sure enough, a little bit of work later and I have my first proof of concept implementation over in lp:~stewart-flamingspork/drizzle/perf-events. That tree has a perf_udf() function that is like BENCHMARK except it returns the number of CPU cycles spent executing the expression N times. For example, how many CPU cycles does it take for the SQL expression MD5(‘pants’) to be evaluated?

drizzle> select perf_udf(1, MD5(‘pants’));
+—————————+
| perf_udf(1, MD5(‘pants’)) |
+—————————+
| 43540 |
+—————————+
1 row in set (0 sec)

So on my laptop, it’s about 40,000 cycles (over multiple runs I saw it vary between about 39,000 to 44,000). The really neat thing about using the perf_events interface is that if you run this on a different CPU architecture that has perf_events available in the kernel you’re currently running, it “just works”. e.g. if I ran this on a POWER5 box, I’d find out how many cycles it took there! No special code in Drizzle, yay!

The really neat next thing I tried was to run the same perf_udf query while running drizzleslap on the same database server, completely hammering it. I get exactly the same result (within normal variance)!

That isn’t the best part though. The best part is the other bits of information we can get out of the PMU:

  • PERF_COUNT_HW_CPU_CYCLES
  • PERF_COUNT_HW_INSTRUCTIONS
  • PERF_COUNT_HW_CACHE_REFERENCES
  • PERF_COUNT_HW_CACHE_MISSES
  • PERF_COUNT_HW_BRANCH_INSTRUCTIONS
  • PERF_COUNT_HW_BRANCH_MISSES
  • PERF_COUNT_HW_BUS_CYCLES

So the same way we can use the new ‘perf’ utility to see what a process is doing to the machine, we can enable people to do exactly the same thing with specific SQL queries (and through a bit of extra code, you could aggregate for users/applications). Not only that, but we could write a plugin for Drizzle to occasionally sample queries running through the database server and build up a quite complete profile of what’s going on over time.

We can also get software events out of perf_events such as:

  • PERF_COUNT_SW_CPU_CLOCK
  • PERF_COUNT_SW_TASK_CLOCK
  • PERF_COUNT_SW_PAGE_FAULTS
  • PERF_COUNT_SW_CONTEXT_SWITCHES
  • PERF_COUNT_SW_CPU_MIGRATIONS
  • PERF_COUNT_SW_PAGE_FAULTS_MIN
  • PERF_COUNT_SW_PAGE_FAULTS_MAJ

So for engines that do memory mapping of files on disk, we can find out which of your SQL queries are causing the page faults! We should also be able to find out if the operating system kernel is bouncing your execution threads around CPUs a lot.

The biggest possibility for awesomeness comes from the perf_event ability to get periodic call traces (you specify how often) including down into the kernel. This means that we could, on demand and only when we enable it, profile where time is being spent for that specific SQL query. The most important thing to note is that when this gathering is not enabled, the overhead is zero. Even when enabled for one query, this should have minimal impact on other things currently running (you’re going to use some extra CPU doing the profile, but we’re only profiling that one SQL query, not the 1000 other ones executing at the same time). So we could tell you that your query spends 9/10ths of its time in filesort() without ever adding any extra instructions to the filesort() codepath (and anywhere else where you may want to find out how much CPU time was spent). We could even tell you how much time was being spent in the kernel doing specific IO operations!

So I’m thinking that this could be used to make something pretty useful that could easily be used on production systems due to the zero overhead when not enabled and the small overhead when enabled.

table_raw_reader – reading the table proto from disk and examining everything

Ever wondered exactly *what* was in a Drizzle Table proto? Well, wonder no more. A while back this little utility called table_raw_reader hit the drizzle codebase. It’s a simple command line utility that takes a .proto file as an argument, reads it off disk and then prints out a text representation using the TextFormat class of the protobuf library.

An example:

stewart@willster:~/drizzle/jay-and-stewart-remove-pack_flag$ ./drizzled/message/table_raw_reader ./tests/var/master-data/test/t1.dfe
name: "t1"
engine {
  name: "InnoDB"
}
field {
  name: "id"
  type: INTEGER
  format: DefaultFormat
  options {
    length: 11
  }
}
field {
  name: "padding"
  type: VARCHAR
  format: DefaultFormat
  options {
    length: 800
  }
  string_options {
    length: 200
    collation_id: 45
    collation: "utf8_general_ci"
  }
}
indexes {
  name: "PRIMARY"
  is_primary: true
  is_unique: true
  type: UNKNOWN_INDEX
  key_length: 4
  index_part {
    fieldnr: 0
    compare_length: 4
    key_type: 8
  }
  options {
  }
}
type: STANDARD
options {
  collation: "utf8_general_ci"
  collation_id: 45
  pack_record: true
  row_type: ROW_TYPE_DEFAULT
}

This can help a lot when debugging, it already has helped me a lot. It’s left as an exercise for the reader to create a utility that lets you modify the table proto in this text based format.

pack_flag is dead!

(At least in Drizzle)

Brian merged my jay-and-stewart-remove-pack_flag branch and it’s now in trunk. Removing pack_flag was a task for the bell milestone and in true collaborative effort, it took more than one person to unravel its dark mysteries. Hats go off to Jay who bravely ripped more of it out with the aid of excellent Seattle coffee. I finished it off with removing the last remnants and fixing an outstanding bug (and at the same time making us look at the DECIMAL code and shuddering in fear of what it may mean).

With pack_flag gone, we can now do insane things like remove the last bits of TINYINT from the code, as previous to removing pack_flag, if you did that, things broke.

This also means we are one more step towards a table proto that is maintainable long term and doesn’t have incomprehensible magic.

SAGE-AU

Before heading to the US for Shizzlin’ the Drizzle (okay, it really is just a meeting) I headed to Surfers for SAGE-AU 2009. I previously spoke at a SAGE-AU in Canberra a couple of years ago (Wikipedia tells me that it must have been 2006). This time, I was there speaking on Drizzle. I could only make the conference for a limited amount of time due to having to get to the US for our Drizzle team meeting (I even skipped out on the dinner early as I had to be up very early for catching flights).

The conference was a bit smaller than I would have hoped (I wonder if economic downturn is part of this, or are people finding that they are getting enough benefit from just conferences such as linux.conf.au and/or OSDC).

The dinner was pretty cool though, at Dreamworld and got to ride tower of terror for an hour :) There were also tigers:

1024DSC_4475(apologies for tremendously high ISO and really bad lighting conditions). They can also climb trees:

1024DSC_4520

Drizzle Meeting Photos

I didn’t take many photos at the Drizzle Meeting, although I did take a couple at the end at the Hopvine (just down the road from Brian’s place).

A good read is Brian’s wrap up of the meeting.

But we have (courtesy of Brian):

and a couple I took at the Hopvine:

DSC_4574DSC_4575DSC_4577DSC_4581DSC_4582DSC_4583

The last one there of Lee looks almost scary… strange light, moving subject, all part of the fun :)

It was really good to get a number of people together and chat. In future, we’ll no doubt have larger gatherings that are really inclusive.