Progress in nofrm branch

“Ban FRM Now!” branch in Launchpad

Now we’re reading part of the table information out of the proto file on disk instead of the frm.

Not everything (yet) but a bit. Good first steps. Had to fix bugs along the way as well (and find weirdness in FRM file format…).

Progress is being made.

People on IRC as some measure of a project

#mysql isn’t too fair to include, as it’s really about users, not dev. #mysql-ndb is there because i heart ndb.

Oh, and linux.conf.au is there because it’s *awesome* and you should go.

Totally unscientific due to i’m only taking a sample once and whatever… but it kinda interests me…

magic number super fun happy time

umm…..

int Field_timestamp::store(double nr)
{
  int error= 0;
  if (nr < 0 || nr > 99991231235959.0)
  {
    set_datetime_warning(DRIZZLE_ERROR::WARN_LEVEL_WARN,
                         ER_WARN_DATA_OUT_OF_RANGE,
                         nr, DRIZZLE_TIMESTAMP_DATETIME);
    nr= 0;					// Avoid overflow on buff
    error= 1;
  }
  error|= Field_timestamp::store((int64_t) rint(nr), false);
  return error;
}

(likely the same in mysql as well… haven’t checked though). these date and time things scare me.

Drizzle progress… (testing can be good)

We’ve been working on fixing up the remaining test cases so that they run with Drizzle. We’ve found: bugs in Drizzle, bugs in MySQL (one that seems to have been there for at least 10 years), bugs in the tests, tests that no longer apply and occationally, something like this:


/* Please god, will someone rewrite this to be readable :( */
if (to->pack_length() == from->pack_length() &&
!(to->flags & UNSIGNED_FLAG && !(from->flags & UNSIGNED_FLAG)) &&
to->real_type() != DRIZZLE_TYPE_ENUM &&
(to->real_type() != DRIZZLE_TYPE_NEWDECIMAL || (to->field_length == from->field_length && (((Field_num*)to)->dec == ((Field_num*)from)->dec))) &&
from->charset() == to->charset() &&
to->table->s->db_low_byte_first == from->table->s->db_low_byte_first &&
(!(to->table->in_use->variables.sql_mode & (MODE_NO_ZERO_DATE | MODE_INVALID_DATES)) || (to->type() != DRIZZLE_TYPE_DATE && to->type() != DRIZZLE_TYPE_DATETIME)) &&
(from->real_type() != DRIZZLE_TYPE_VARCHAR || ((Field_varstring*)from)->length_bytes == ((Field_varstring*)to)->length_bytes))
{ // Identical fields
/* This may happen if one does 'UPDATE ... SET x=x' */
if (to->ptr != from->ptr)
memcpy(to->ptr,from->ptr,to->pack_length());
return 0;
}

and no, I haven’t really changed the formatting.

Speaker: MySQL Conference & Expo 2009 – O’Reilly Conferences, April 20 – 23, 2009, Santa Clara, CA

Yes, I’m speaking at  the upcoming MySQL Conference & Expo 2009 – on April 20 – 23 (and yes, it’s in Santa Clara again).

I have three sessions:

MySQL Cluster Tutorial: this time with 6.4 feature goodness. Very hands-on, very interactive.

MySQL Cluster on Windows:  (insert witty text about hating operating system freedom here)

Memory Management in MySQL and Drizzle: not magic setting of buffer variables, but memory allocation and management inside the server, a bunch of malloc() discussion and hopefully some interesting numbers.

fixing drizzle on linux sparc

Since I got fed up with Solaris the other day, the T1000 is running Debian. This means that “I’ll care about Drizzle on Linux Sparc”.

OMG were things broken in the most “trivial” ways.

A good quick intro to the issues is Memory alignment on SPARC, or a 300x speedup!

It all comes down to memory alignment.

So I pulled the MySQL 6.0 bzr tree onto the box to try it too… I haven’t seen so many compiler warnings in ages (okay… since I last built MySQL.. drizzle is warning-clean and it makes it hard to remember a time before that). I think it works purely by accident.

So I’m gradually getting all of Drizzle working on Linux Sparc (a few things fixed already).

It’d be great if the T1k had faster disk though (make -j30 is fun… but IO isn’t on a single 160GB 7200rpm disk)… anybody wanting to donate an SSD?

Stewart learns SQL oddities…

What would you expect the following to fail with?

CREATE TABLE t1 (a int, b int);
insert into t1 values (100,100);
CREATE TEMPORARY TABLE t2 (a int, b int, primary key (a));
BEGIN;
INSERT INTO t2 values(100,100);
CREATE TEMPORARY TABLE IF NOT EXISTS t2 (primary key (a)) select * from t1;

If you answered ER_DUP_ENTRY then you are correct.

From the manual:

Note

If you use IF NOT EXISTS in a CREATE TABLE ... SELECT statement, any rows selected by the SELECT part are inserted regardless of whether the table already exists.

Does anybody else find this behaviour “interesting”?

What constitutes an identifier for a table?

Well… there’s:

  • database
  • table name

(both of these are quite obvious).

But then you have:

  • temporary tables

Well… two types of temporary tables:

  • those created in the course of query execution, typically in /tmp/
  • those created during ALTER TABLE, typically in the database directory

You may have seen these “#sql-foo.frm” etc files around.

but you can also CREATE TABLE `#sql-foo` (you know, because that’s a good string to use in your app). In fact, you can (and shouldn’t) create a table with the exact same name as the temporary #sql table and use it in your app.

So really the primary key for a table is: string database name, string table name, bool is_tmp. Oh, and the /tmp/ temporary tables just to screw with your braiiiinn.

In drizzle, this is what the storage engine API is ending up looking like. It’s the responsibility of the engine to encode the temporary table names so that they don’t clash with the non-temporary table names. This could be done by putting them in a different set of files, marking them with som eflag, or just passing the is_tmp flag to build_table_filename (or make_table_key) and being done with it.

I think this ends up being the least-hacky approach and it does mean we can remove the bass ackwards if(table_name[0]==’#’) checks from some engines.

Temporary tables in any engine….

Well… nearly any engine.

I have a plan forming in my head to add some hooks to engines to help with creating temporary tables (the ones created while executing a query, not ones created during ALTER TABLE).

Currently, if you ALTER TABLE and we require a temporary table, it’s still database.table but we generate a table name that’s small, unique and begins with “#sql”.

I’ve changed some of the handler interface to accept two strings (database name, table name) instead of one “path” that may (or may not) end in “.FRM” and may (or may not) begin with “./” and may (or may not) use the “/” separator between database and table name (hint: on win32, it’s sometimes “”).

So ha_delete_table is now: ha_delete_table(db, table_name). Sanity!

(this has the downside of being a incompatible change that doesn’t break the build as there seems to be no way in C++ to say “derived classes cannot implement a function of this name”).

This is annoying for temporary tables however.

They don’t reside in a database… they’re off in la-la land (otherwise known as opt_drizzle_tmpdir).

Now… in MySQL 5.1 the ability for multiple temporary directories was added, and the MySQL server will cycle through them. The clusterfuck part of this was that a mutex was added… so every time the code goes to get the name of a (or for 99.99999999% of cases, the) temporary directory, it has to grab a mutex. With modern systems being able to have *many* simultaneous IO operations (e.g. create files) this is just dumb. It’s gone.

If you want to use multiple spindles for temporary tables it’s called RAID people (or buy an SSD, mkfs.ext2 it and just wipe it on reboot. simple.)

Anyway… back to temporary tables:

they’re in /tmp or something. So an API that’s foo(const char *db, const char *table_name) doesn’t work as well.

Also, for a bunch of engines, it’s good to know that you’re using a temporary table. You probably want to store these somewhere that never needs fsync() or anything like that. If the server goes away, these tables are *gone*. So optimise for that.

Heck, it may even be good to store temporary tables in your distributed engine (if that engine is memory based) as it’s often faster to access remote memory than local disk (although with SSD this is a whole different ball game… in fact, I’m not sure if it’s even still a ball game.. it’s possibly frisbee)

But currently there’s hardcoded a mi_create (MyISAM create) call in the server and in the 6.0-maria tree, an #ifdef around if it’s mi_create or maria_create.

So not any engine yet… but one can dream. Yes, I dare to dream.

(although why I was dreaming of a small board with a PowerPC 603, 8MB RAM and a mini DVI port the other night is quite beyond me)

I don’t read code comments

They are wrong.

Misleading at best.

Reworking parts of Drizzle (which came directly from MySQL) it can get painfully obvious. Things like “afaiu” and “???” appear in more than one place (that is if the comment isn’t just obviously wrong).

A comment merely states what one person thought the code did at some point in the past. It has no relation to what the code actually does now.

The Drizzle Snowman – PlanetMySQL fail

If you went “wtf” at The Drizzle Snowman – WIN! – Jay Pipes on PlanetMySQL suddenly ending at “create table”, you should click through and see the unicode character for a snowman.

In other fun, we’ve also created tables with the name of cloud symbol, umbrella symbol and umbrella with rain drops symbol. All of these seem rather appropriate for Drizzle.

What VERSION in INFORMATION_SCHEMA.TABLES means (hint: not what you think)

It’s the FRM file format version number.

It’s not the version of the table as one might expect (i.e. after CREATE it’s 1. Then, if you ALTER, it’s 2. Alter again 3 etc).

In Drizzle, we now return 0.

In future, I plan that Drizzle will allow the engine to say what version it is (where 0 is “dunno”).

This’ll be a good step towards being able to cope with multiple versions of a table in use at once (and making sense of this to the user).

drop table fail (on the road to removing the FRM)

So… in removing the FRM file in Drizzle, I found a bit of a nugget on how drop table works (currently in the MySQL server and now “did” in Drizzle).

If you DROP TABLE t1; this is what happens

  • open the .frm file
  • read first 10bytes (oh, and if you get EIO there, in a SELECT * FROM INFORMATION_SCHEMA.TABLES you’ll get “Error” instead of “Base Table”)
  • if (header[0] != (unsigned char) 254 || header[1] != 1 ||
    (header[2] != FRM_VER && header[2] != FRM_VER+1 &&
    (header[2] < FRM_VER+3 || header[2] > FRM_VER+4)))
    return true;
    Which means that you probably (well, should have) set your enum legacy_db_type to DB_TYPE_UNKNOWN in the caller of bool mysql_frm_type(Session *, char *path, enum legacy_db_type *dbt) otherwise you end up in some form of pain.
  • Else, *dbt= (enum legacy_db_type) (uint) *(header + 3);
    return true;                   // Is probably a .frm table

I do like the “probably”.

Oh, and on a “storage engine api” front, some places seem to expect handler::delete_table(const char* name) to return ENOENT on table not existing. In reality however:

  • int ha_heap::delete_table(const char *name)
    {
    -  int error= heap_delete_table(name);
    -  return error == ENOENT ? 0 : error;
    +  return heap_delete_table(name);
    }
  • InnoDB (note the behaviour of returning DB_TABLE_NOT_FOUND… which isn’t ENOENT)
    err = DB_TABLE_NOT_FOUND;
    ut_print_timestamp(stderr);

    fputs(“  InnoDB: Error: table “, stderr);
    ut_print_name(stderr, trx, TRUE, name);
    fputs(” does not exist in the InnoDB internaln”
    “InnoDB: data dictionary though MySQL is”
    ” trying to drop it.n”
    “InnoDB: Have you copied the .frm file”
    ” of the table to then”
    “InnoDB: MySQL database directory”
    ” from another database?n”
    “InnoDB: You can look for further help fromn”
    “InnoDB: http://dev.mysql.com/doc/refman/5.1/en/”
    “innodb-troubleshooting.htmln”,
    stderr);

  • and MyISAM would generate the error message itself, but that’s fixed with:
    -  if (my_delete_with_symlink(from, MYF(MY_WME)))
    +  if (my_delete_with_symlink(from, 0))
    return(my_errno);

and just to add to the fun, elsewhere in the code, a access(2) call on the frm file name is used to determine if the table exists or not.

The road to removing the FRM has all sorts of these weird-ass things along it. Kinda nice to be able to replace this with something better (and, hopefully – good).

But let me sum up with sql_table.cc:

“This code is wrong and will be removed, please do not copy.”

Goodbye FRM (or at least the steps to it)

Since before MySQL was MySQL, there has been the .FRM file. Of course, what it really wanted to be was “.form” -  a file that stored how to display a form on your (green) CRT. Jan blogged earlier in the year on this still being there, even in MySQL 5.1 (albeit not in any useful form).

So why do we want it to die?

Well… it’s not exactly very useful anymore.

There are a few things it’s used for….

If database/table.frm exists, the table exists (or, on Windows, you may also get databasetable.frm). This is tested in a few bits in the code by a call to access(2).

Most engines have their own data dictionaries (Innodb, PBXT, NDB, Falcon). Keeping these in sync with the FRMs can be problematic at best. This is especially true with distributed engines such as NDB.

The current solution is that on the SQL node that is creating the table, we create the FRM file, gzip it, and store it in the cluster. Then, other nodes, if they go “err… no local frm” first call ha_create_table_from_engine() which NDB will go and see if the table exists in the cluster. If so, it copies the FRM from the cluster to local disk and then the SQL server continues on its way with the standard way of opening a table (through the FRM). If you do DDL through the NDB-API (and not via SQL) then well… you get to keep both pieces.

As for if you crash during a table rename (with any engine with its own data dictionary.. e.g. InnoDB)… you again get to keep both pieces. (There is a bit of discussion on this over here)

Having FRM files also doesn’t especially lead to having multiple versions of table metadata co-existing in the server.

The fun part of reading a frm is open_binary_frm in table.cc. It reads in the frm into a TABLE_SHARE. If we only had some other way of filling out a TABLE_SHARE… one from the engine itself…

But what about any metadata that the engine data dictionary doesn’t have? For example, many server types may map to 1 engine type. An example of this is the GIS types in MySQL. For most engines, these just map to BLOBs. The engine itself has no knowledge about that, but we should fill out the table definition correctly…. so for this type of thing the engine may need to store some additional metadata. This is pretty easy for transactional engines: put it in a table! (although you then have your own problem about keeping this synchronised with any DDL). For engines that don’t have their own data dictionary, we can just provide a set of routines to store/read a frm type file (based on protobufs no doubt).

There also seems to be some entanglement with LOCK_open. Ahhh LOCK_open, the lock that nobody can possibly understand.

The tricky thing will be not rewriting every little bit from scratch all at once but rather go for the incremental bits….

Singing in the Rain

The past 3 years, 11 months I have worked full time on NDB (MySQL Cluster). It’s been awesome. Love the product and people. In the time I’ve been on the Cluster team, we’ve gone from a small group that would easily fit in the (old old) Stockholm office to one that requires large rooms to house us all in. It’s also been all about smart people (you have to be to work on a distributed database).

With MySQL Cluster 6.4 we’re getting in a bunch of features that have been on the “wide adoption” wishlist. With each release of NDB we’ve gained a wedge of applications that can be used with it – and 6.4 is no exception.

One of the biggest things that’s been worked on is multithreaded data nodes. If you check out Jonas‘ recent posts on 500,000 reads/sec and then a massive 700,000 reads/sec.

We’ve also got a Microsoft Windows port coming up, which a number of people have asked for over the years. Mostly I think this is a “I want to try it out” thing and not a deployment thing. (can any sane person deploy a HA app on Win32?)

I’ve used “NDB$INFO” as the ultimate answer to any problem for a while now. It’s been the much-wanted monitoring interface. We have a lot of info inside NDB that currently isn’t easily user accessible (or only accessible through the magic DUMP interface or by gathering up many events in the cluster log). We have the start of NDB$INFO in 6.4 now and Martin will be continuing my work in making it truly awesome.

So go and grab the 6.4 tree and have a look – things are looking sweet.

What next for me?

Well… a while ago I started hacking on Drizzle. Why? Well… I thought we could move the database server in a new direction and make it more modular, leaner, meaner query machine.

And now, I’m starting to work on it full time.

It’s exciting, and I’ll be blogging on the first TODO which is remove the FRM file and switch to a full discovery method shortly.

UPDATE: Yes, I’m working full time on Drizzle for Sun Microsystems (in the CTO group). While not spending work time on NDB anymore, no doubt you’ll still see fun-time patches.