Performance Schema: Show me the code

For such a long worked on feature, with such potential – I find the resistence to publishing a source tree curious (my comments on the topic have been moderated away but others have asked too). I could go and grep through the commits list searching for things (hint: look for mysql-6.0-perf), and then start to re-construct a tree; but I have more important things to do (yes, Brian, like FRM patches :)

Instead of re-inventing the wheel in Drizzle for a performance schema like interface, it’d be great to go with existing work. Evaluating the code as it’s coming along is important.

I also have concerns about the code itself:

  • Mutex instrumentation:
    • how expensive is this in the common case of not instrumenting.
    • Is this yet-another wrapper around pthread_mutex_t?
    • Could this be done in another, more generic way?
    • How can engine devs use this? Do you have to completely be integrated with the MySQL server way of doing things (and give up being able to be a sep piece of software) if you’re going to use this. If there’s a null header, what license is it under?
    • Can we use some of the code in (for example) the ndbd and then pass back mutex data from remote systems to the SQL server in a usable mannar? If we do this via NDB$INFO, could this show up in the performance schema?
    • Is the code clean or littered with ugly #ifdef?
    • Could this be done without having a special mutex type?
  • memory instrumentation
    • is it there at all?
    • how are they doing it?
    • MEM_ROOT based? what about new/delete malloc/free and various buffers and how this all ties to session etc?
      • In drizzle I’ve been seriously looking at talloc to help with instrumentation of memory usage.
  • IO instrumentation
    • mmap?
  • how are the performance schema tables being generated (constructing table shares, running CREATE TABLE manually by the user, CREATE TABLE in a helper thread, similar to I_S or some black magic?)
    • for NDB$INFO, we generate SQL CREATE TABLE statements and run them to generate a FRM file (I architected and wrote the base kernel code, Martin wrote the MySQL code) as other approaches were considered too hairy and likely to produce bugs.
    • For Drizzle, I’m completely removing the FRM files in favour of a discovery based interface that’ll let the engines be in charge of metadata. It’s all in protobufs, so a standard and easy to read data format. So I’m one of the few people on the planet that know about the related data structures. I would like our proto based code to work for performance schema as well.
  • Is the instrumentation always-in, or using d-trace style no-op funkyness?
  • Is it better to hook into d-trace (or similar) on platforms that have it instead of providing custom code?
  • Could performance be gained by using LD_PRELOAD or similar linker foo to only enable some instrumentation but allow others to be selectable on server startup?

So stuck waiting for some code to look at to answer the questions (random commits on the commits list doesn’t really do it like the finished product – i really hate commit lists).

So I can’t currently comment on the performance schema work much at all, nor if it’s useful to Drizzle. Hopefully will soon.

row id in MySQL and Drizzle (and the engines)

Some database engines have a fundamental concept of a row id. The row id is everything you need to know to locate a row. Common uses include secondary indexes (key is what’s indexed, value is rowid which you then use to lookup the row).

One design is the InnoDB method of having secondary indexes have the value in the index be the primary key of the row. Another is to store the rowid instead. Usually (or often… or sometimes…) rowid is much smaller than the pkey of the row. This is how innodb can answer some queries just out of the index. If it used rowid, it may involve more IO to answer the query. All this is irrelevant if you never want just the primary key from a secondary index.

Some engines are designed from the start to have rowid, others it’s added later (e.g. NDB).

Anyway… all beside the point. Did you know you can do this in mysql or drizzle:

drizzle> create table t1 (a int primary key);
Query OK, 0 rows affected (0.02 sec)

drizzle> insert into t1 (a) values (1);
Query OK, 1 row affected (0.01 sec)

drizzle> select _rowid from t1;
+--------+
| _rowid |
+--------+
|      1 |
+--------+
1 row in set (0.00 sec)

Is that the rowid from the engine? No (although at least NDB will let you select the real ROWID through a pseudo column through NDBAPI). Quoting from the MySQL manual:

If a PRIMARY KEY or UNIQUE index consists of only one column that has an integer type, you can also refer to the column as _rowid in SELECT statements.

Unfortunately, this isn’t correct… as this lovely bit of “oh my, what an excellent way to obfuscate my database app!” shows:

drizzle> create table t1 (a int primary key, b varchar(100));
Query OK, 0 rows affected (0.02 sec)

drizzle> insert into t1 values (1,”foo”);
Query OK, 1 row affected (0.00 sec)

drizzle> update t1 set b=”foobar!” where _rowid=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

drizzle> select * from t1;
+—+———+
| a | b |
+—+———+
| 1 | foobar! |
+—+———+
1 row in set (0.00 sec)

So how is this implemented? In two places: in sql_base.cc find_field_in_table() and in table.cc during FRM parsing (this is how I found it). We can even do things Oracle can’t (insert, update and delete):

drizzle> update t1 set a=2 where _rowid=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

drizzle> select * from t1;
+---+---------+
| a | b       |
+---+---------+
| 2 | foobar! |
+---+---------+
1 row in set (0.00 sec)

drizzle> update t1 set _rowid=3 where _rowid=2;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

drizzle> select * from t1;
+---+---------+
| a | b       |
+---+---------+
| 3 | foobar! |
+---+---------+
1 row in set (0.00 sec)

SQLite also has something similar (see the autoinc docs).

I do wonder if anybody uses this functionality. It’s even tested (I was quite shocked at this) in the auto_increment and heap_auto_increment tests.

Progress in nofrm branch

“Ban FRM Now!” branch in Launchpad

Now we’re reading part of the table information out of the proto file on disk instead of the frm.

Not everything (yet) but a bit. Good first steps. Had to fix bugs along the way as well (and find weirdness in FRM file format…).

Progress is being made.

People on IRC as some measure of a project

#mysql isn’t too fair to include, as it’s really about users, not dev. #mysql-ndb is there because i heart ndb.

Oh, and linux.conf.au is there because it’s *awesome* and you should go.

Totally unscientific due to i’m only taking a sample once and whatever… but it kinda interests me…

magic number super fun happy time

umm…..

int Field_timestamp::store(double nr)
{
  int error= 0;
  if (nr < 0 || nr > 99991231235959.0)
  {
    set_datetime_warning(DRIZZLE_ERROR::WARN_LEVEL_WARN,
                         ER_WARN_DATA_OUT_OF_RANGE,
                         nr, DRIZZLE_TIMESTAMP_DATETIME);
    nr= 0;					// Avoid overflow on buff
    error= 1;
  }
  error|= Field_timestamp::store((int64_t) rint(nr), false);
  return error;
}

(likely the same in mysql as well… haven’t checked though). these date and time things scare me.

Drizzle progress… (testing can be good)

We’ve been working on fixing up the remaining test cases so that they run with Drizzle. We’ve found: bugs in Drizzle, bugs in MySQL (one that seems to have been there for at least 10 years), bugs in the tests, tests that no longer apply and occationally, something like this:


/* Please god, will someone rewrite this to be readable :( */
if (to->pack_length() == from->pack_length() &&
!(to->flags & UNSIGNED_FLAG && !(from->flags & UNSIGNED_FLAG)) &&
to->real_type() != DRIZZLE_TYPE_ENUM &&
(to->real_type() != DRIZZLE_TYPE_NEWDECIMAL || (to->field_length == from->field_length && (((Field_num*)to)->dec == ((Field_num*)from)->dec))) &&
from->charset() == to->charset() &&
to->table->s->db_low_byte_first == from->table->s->db_low_byte_first &&
(!(to->table->in_use->variables.sql_mode & (MODE_NO_ZERO_DATE | MODE_INVALID_DATES)) || (to->type() != DRIZZLE_TYPE_DATE && to->type() != DRIZZLE_TYPE_DATETIME)) &&
(from->real_type() != DRIZZLE_TYPE_VARCHAR || ((Field_varstring*)from)->length_bytes == ((Field_varstring*)to)->length_bytes))
{ // Identical fields
/* This may happen if one does 'UPDATE ... SET x=x' */
if (to->ptr != from->ptr)
memcpy(to->ptr,from->ptr,to->pack_length());
return 0;
}

and no, I haven’t really changed the formatting.

Speaker: MySQL Conference & Expo 2009 – O’Reilly Conferences, April 20 – 23, 2009, Santa Clara, CA

Yes, I’m speaking at  the upcoming MySQL Conference & Expo 2009 – on April 20 – 23 (and yes, it’s in Santa Clara again).

I have three sessions:

MySQL Cluster Tutorial: this time with 6.4 feature goodness. Very hands-on, very interactive.

MySQL Cluster on Windows:  (insert witty text about hating operating system freedom here)

Memory Management in MySQL and Drizzle: not magic setting of buffer variables, but memory allocation and management inside the server, a bunch of malloc() discussion and hopefully some interesting numbers.

VirtualBox 2.1.0 (and OpenSolaris 2008.11)

Upgraded VirtualBox and booted up my OpenSolaris VM. VirtualBox 2.1.0 finally fixes the bug where if 127.0.0.1 was in resolv.conf on the host – no DNS for you in the guest (unless in the guest you were running a DNS server).

Haven’t tried it yet… but OpenGL Accelleration makes at least a checkbox appearance in VirtualBox 2.1…. so that could be rather awesome.

Going a lot better with OpenSolaris 2008.11 than previous releases.. It looks like it might be quite easy to get to the stage of building Drizzle on it.

Just figured out how to change to Dvorak! Yay, I can type again! (Go to Input Methods preference panel and add US/DVORAK as a language, move it to the top, and enable the input method application and do it that way).

Currently installing sunstudioexpress. Why not gcc? I’m pretty sure the version in OpenSolaris is still ancient (so won’t build drizzle) and Sun Studio does produce different warnings (which indicate real bugs in a bunch of cases).

Things I wish were packaged: latest protobufs, latest bzr, gcc 4.x

fixing drizzle on linux sparc

Since I got fed up with Solaris the other day, the T1000 is running Debian. This means that “I’ll care about Drizzle on Linux Sparc”.

OMG were things broken in the most “trivial” ways.

A good quick intro to the issues is Memory alignment on SPARC, or a 300x speedup!

It all comes down to memory alignment.

So I pulled the MySQL 6.0 bzr tree onto the box to try it too… I haven’t seen so many compiler warnings in ages (okay… since I last built MySQL.. drizzle is warning-clean and it makes it hard to remember a time before that). I think it works purely by accident.

So I’m gradually getting all of Drizzle working on Linux Sparc (a few things fixed already).

It’d be great if the T1k had faster disk though (make -j30 is fun… but IO isn’t on a single 160GB 7200rpm disk)… anybody wanting to donate an SSD?

Debian about 1234533 times easier to install than Solaris

After what many hours trying to netboot the T1000 to install Solaris Express, I wondered “how hard is it for debian?”.

Easy. get the sparc64 boot.img, put it on TFTP server, add “filename “boot.img”;” or similar to dhcp, boot the T1000 from the service console something like this “bootmode bootscript=”boot net:dhcp”\n restart -c” and install away!

As for Solaris?

Well… dhcpd.conf:

option space SUNW;
option SUNW.root-mount-options code 1 = text;
option SUNW.root-server-ip-address code 2 = ip-address;
option SUNW.root-server-hostname code 3 = text;
option SUNW.root-path-name code 4 = text;
option SUNW.swap-server-ip-address code 5 = ip-address;
option SUNW.swap-file-path code 6 = text;
option SUNW.boot-file-path code 7 = text;
option SUNW.posix-timezone-string code 8 = text;
option SUNW.boot-read-size code 9 = unsigned integer 16;
option SUNW.install-server-ip-address code 10 = ip-address;
option SUNW.install-server-hostname code 11 = text;
option SUNW.install-path code 12 = text;
option SUNW.sysid-config-file-server code 13 = text;
option SUNW.JumpStart-server code 14 = text;
option SUNW.terminal-name code 15 = text;
option SUNW.SbootURI code 16 = text;

host hurricane {
hardware ethernet 0:14:4f:1e:28:e;
fixed-address 192.168.1.19;
option host-name “hurricane”;
filename “sparc64-etch-boot.img”;
#       filename “sol-nv-b103-sparc”;
#       option SUNW.install-server-ip-address 192.168.1.1;
#       option SUNW.install-server-hostname “saturn”;
#       option SUNW.install-path “/mnt/sol-nv-b103-sparc/”;
#       option SUNW.root-server-ip-address 192.168.1.1;
#       option SUNW.root-server-hostname “saturn”;
#       option SUNW.root-path-name “/mnt/sol-nv-b103-sparc/Solaris_11/Tools/Boot”;

}

(obviously changing the comments around) and having the Solaris Express DVD mounted and NFS exported…. it *still* doesn’t work. It goes “unable to mount filesystem” with no further hints (even when tcpdumping the network).

Documentation for doing the simple thing of using $dhcp_server and $nfs_server to network boot a Solaris install on a Sparc box is *COMPLETELY* missing.

Now, I’m a smart guy (and if you don’t believe that, at least believe I’m not stupid). If I can’t get it to boot the installer, what chance do others have?

I’ll try OpenSolaris out when it’s on SPARC (and please oh please oh please just have an easy way to net boot the installer using a linux host). Please take the debian way (just a single file on tftp).

So now it’s goodbye Solaris (I’m not going to have something I can’t  re-install, upgrade or security patch) and it’s hello Debian (and sanity).

Yes, this does mean I’ll care about Drizzle on Linux Sparc.

What constitutes an identifier for a table?

Well… there’s:

  • database
  • table name

(both of these are quite obvious).

But then you have:

  • temporary tables

Well… two types of temporary tables:

  • those created in the course of query execution, typically in /tmp/
  • those created during ALTER TABLE, typically in the database directory

You may have seen these “#sql-foo.frm” etc files around.

but you can also CREATE TABLE `#sql-foo` (you know, because that’s a good string to use in your app). In fact, you can (and shouldn’t) create a table with the exact same name as the temporary #sql table and use it in your app.

So really the primary key for a table is: string database name, string table name, bool is_tmp. Oh, and the /tmp/ temporary tables just to screw with your braiiiinn.

In drizzle, this is what the storage engine API is ending up looking like. It’s the responsibility of the engine to encode the temporary table names so that they don’t clash with the non-temporary table names. This could be done by putting them in a different set of files, marking them with som eflag, or just passing the is_tmp flag to build_table_filename (or make_table_key) and being done with it.

I think this ends up being the least-hacky approach and it does mean we can remove the bass ackwards if(table_name[0]==’#’) checks from some engines.

Temporary tables in any engine….

Well… nearly any engine.

I have a plan forming in my head to add some hooks to engines to help with creating temporary tables (the ones created while executing a query, not ones created during ALTER TABLE).

Currently, if you ALTER TABLE and we require a temporary table, it’s still database.table but we generate a table name that’s small, unique and begins with “#sql”.

I’ve changed some of the handler interface to accept two strings (database name, table name) instead of one “path” that may (or may not) end in “.FRM” and may (or may not) begin with “./” and may (or may not) use the “/” separator between database and table name (hint: on win32, it’s sometimes “”).

So ha_delete_table is now: ha_delete_table(db, table_name). Sanity!

(this has the downside of being a incompatible change that doesn’t break the build as there seems to be no way in C++ to say “derived classes cannot implement a function of this name”).

This is annoying for temporary tables however.

They don’t reside in a database… they’re off in la-la land (otherwise known as opt_drizzle_tmpdir).

Now… in MySQL 5.1 the ability for multiple temporary directories was added, and the MySQL server will cycle through them. The clusterfuck part of this was that a mutex was added… so every time the code goes to get the name of a (or for 99.99999999% of cases, the) temporary directory, it has to grab a mutex. With modern systems being able to have *many* simultaneous IO operations (e.g. create files) this is just dumb. It’s gone.

If you want to use multiple spindles for temporary tables it’s called RAID people (or buy an SSD, mkfs.ext2 it and just wipe it on reboot. simple.)

Anyway… back to temporary tables:

they’re in /tmp or something. So an API that’s foo(const char *db, const char *table_name) doesn’t work as well.

Also, for a bunch of engines, it’s good to know that you’re using a temporary table. You probably want to store these somewhere that never needs fsync() or anything like that. If the server goes away, these tables are *gone*. So optimise for that.

Heck, it may even be good to store temporary tables in your distributed engine (if that engine is memory based) as it’s often faster to access remote memory than local disk (although with SSD this is a whole different ball game… in fact, I’m not sure if it’s even still a ball game.. it’s possibly frisbee)

But currently there’s hardcoded a mi_create (MyISAM create) call in the server and in the 6.0-maria tree, an #ifdef around if it’s mi_create or maria_create.

So not any engine yet… but one can dream. Yes, I dare to dream.

(although why I was dreaming of a small board with a PowerPC 603, 8MB RAM and a mini DVI port the other night is quite beyond me)

I don’t read code comments

They are wrong.

Misleading at best.

Reworking parts of Drizzle (which came directly from MySQL) it can get painfully obvious. Things like “afaiu” and “???” appear in more than one place (that is if the comment isn’t just obviously wrong).

A comment merely states what one person thought the code did at some point in the past. It has no relation to what the code actually does now.

Hold me, I’m scared….

From ha_myisam.cc:

/*
TODO: switch from protocol to push_warning here. The main reason we didn’t
it yet is parallel repair. Due to following trace:
mi_check_print_msg/push_warning/sql_alloc/my_pthread_getspecific_ptr.

Also we likely need to lock mutex here (in both cases with protocol and
push_warning).
*/
protocol->prepare_for_resend();
protocol->store(name, length, system_charset_info);
protocol->store(param->op_name, system_charset_info);
protocol->store(msg_type, system_charset_info);
protocol->store(msgbuf, msg_length, system_charset_info);
if (protocol->write())
sql_print_error(“Failed on my_net_write, writing to stderr instead: %s\n”,
msgbuf);
return;

Hopefully this will serve as a good TODO list to go and fix at some point.