The FRM file format

It’s fortunate that I’m watching Veronica Mars again with a mate; a more-than-you-think amount of detective work is required to understand the relationship (and format) of the TABLE_SHARE, the FRM file and HA_CREATE_INFO. Oh, also you’ll need drizzled/base.h and drizzled/structs.h and drizzled/table_share.h is also a good one to have open.

The FRM file is really a FoRM file from UNIREG (see copies of really old mysql docs around the place or even better, the links off Sheeri‘s blog post). Also, Jan has some thoughts on FRM too and Thava has a scary frmdump php script.

I have to agree completely with Jan:

  • “the internals document is missing all the interesting parts”
    I’ve just read the source. Everything in the internals doc is easily gotten from the source, so when I finally did take a close look it was “I know all this”. I can’t fault the docs team here at all – I’d place this 3rd from the bottom in priorities. In fact, it’s better to fix it than to document it.
  • ” get the hands dirty and get into the code … it got really dirty”
    Oh yeah – and it’s only gotten worse with things added to it.

It contains interesting nuggets like unireg_check (or unireg_type, depending on where you read) that does:

 enum utype  { NONE,DATE,SHIELD,NOEMPTY,CASEUP,PNR,BGNR,PGNR,YES,NO,REL,
                CHECK,EMPTY,UNKNOWN_FIELD,CASEDN,NEXT_NUMBER,INTERVAL_FIELD,
                BIT_FIELD, TIMESTAMP_OLD_FIELD, CAPITALIZE, BLOB_FIELD,
                TIMESTAMP_DN_FIELD, TIMESTAMP_UN_FIELD, TIMESTAMP_DNUN_FIELD};

But really only the timestamp things… which should be default magic, but it’s somewhere tied in (Jay had thoughts last time I spoke to him… hopefully going away soon). A bunch of these aren’t ever used and are just relics from UNIREG. In fact… I went and removed what wasn’t needed and just ended up with:

  enum utype  { NONE,
                NEXT_NUMBER,
                TIMESTAMP_OLD_FIELD,
                TIMESTAMP_DN_FIELD, TIMESTAMP_UN_FIELD, TIMESTAMP_DNUN_FIELD};

Which does seem a bit nicer. The  fact that TIMESTAMP_OLD_FIELD is used as in interim value is, wel, scary. At least with a smaller set of possiblities it will be easier to convert into the proto format.

A hint of a brighter future is in the comment there:

/*
    We use three additional unireg types for TIMESTAMP to overcome limitation
    of current binary format of .frm file. We'd like to be able to support
    NOW() as default and on update value for such fields but unable to hold
    this info anywhere except unireg_check field. This issue will be resolved
    in more clean way with transition to new text based .frm format.
    See also comment for Field_timestamp::Field_timestamp().
  */

Hrrm… a text based FRM? That would be much nicer to read the code for. Unfortunately, it doesn’t really exist. Some FRMs are text in MySQL, but not ones to do with tables. You can look at the FRM for a VIEW in a text editor and see the SQL quite easily (the file format is text).

So I can’t go look at any nice text based format code – it’s all uint2korr() and friends. Yes folks, this is about the only place left in the code with function names in Swedish. What does korr mean? “accurate, correct, correctly”. If you look at korr.h, you’ll see that it’s just for storing in machine independent format: low byte first.

My favourite korr functions:

  • uint3korr
    which reads 4 bytes, so remember to alloc it, initialise it or Valgrind will make you its bitch.
  • uint5korr
    err… 5 bytes of course
  • uint6korr
    6 bytes (getting the pattern now)
  • uint7korr
    which doesn’t actually exist. Nobody loves Seven – George Costanza was wrong.

It also (as Jan showed) does the whole layout on a 80 column terminal for you! This functionality is going, going gone in Drizzle and won’t be coming back.

There’s also an “empty record on start of formfile” (see make_empty_rec in unireg.cc). This bit is going to cause me some pain relatively soon. Not so much for writing something like it out (default values can be easily put in the proto) but by then constructing it on open (with some careful footing around the issue of the egg coming before the chicken).

Incidently, when discussing with Daniel Stone about this (and explaining all the weirdness) it did cause him to exclaim “omg, it’s XKB!” – so that probably helps the X hackers in the room to relate.

The biggest test in moving from FRM to proto is to only rewrite this part of the code – the TABLE_SHARE, field, Create_foo etc have sooo many bits I want to change/fix. Going down the rat-hole into an endless cycle of fixing is always a possibility. Sometimes (like with unireg_type) the cleanup lets me really discover what the code is doing, so that’s being done (but will go away “soon”).

Performance Schema: Show me the code

For such a long worked on feature, with such potential – I find the resistence to publishing a source tree curious (my comments on the topic have been moderated away but others have asked too). I could go and grep through the commits list searching for things (hint: look for mysql-6.0-perf), and then start to re-construct a tree; but I have more important things to do (yes, Brian, like FRM patches :)

Instead of re-inventing the wheel in Drizzle for a performance schema like interface, it’d be great to go with existing work. Evaluating the code as it’s coming along is important.

I also have concerns about the code itself:

  • Mutex instrumentation:
    • how expensive is this in the common case of not instrumenting.
    • Is this yet-another wrapper around pthread_mutex_t?
    • Could this be done in another, more generic way?
    • How can engine devs use this? Do you have to completely be integrated with the MySQL server way of doing things (and give up being able to be a sep piece of software) if you’re going to use this. If there’s a null header, what license is it under?
    • Can we use some of the code in (for example) the ndbd and then pass back mutex data from remote systems to the SQL server in a usable mannar? If we do this via NDB$INFO, could this show up in the performance schema?
    • Is the code clean or littered with ugly #ifdef?
    • Could this be done without having a special mutex type?
  • memory instrumentation
    • is it there at all?
    • how are they doing it?
    • MEM_ROOT based? what about new/delete malloc/free and various buffers and how this all ties to session etc?
      • In drizzle I’ve been seriously looking at talloc to help with instrumentation of memory usage.
  • IO instrumentation
    • mmap?
  • how are the performance schema tables being generated (constructing table shares, running CREATE TABLE manually by the user, CREATE TABLE in a helper thread, similar to I_S or some black magic?)
    • for NDB$INFO, we generate SQL CREATE TABLE statements and run them to generate a FRM file (I architected and wrote the base kernel code, Martin wrote the MySQL code) as other approaches were considered too hairy and likely to produce bugs.
    • For Drizzle, I’m completely removing the FRM files in favour of a discovery based interface that’ll let the engines be in charge of metadata. It’s all in protobufs, so a standard and easy to read data format. So I’m one of the few people on the planet that know about the related data structures. I would like our proto based code to work for performance schema as well.
  • Is the instrumentation always-in, or using d-trace style no-op funkyness?
  • Is it better to hook into d-trace (or similar) on platforms that have it instead of providing custom code?
  • Could performance be gained by using LD_PRELOAD or similar linker foo to only enable some instrumentation but allow others to be selectable on server startup?

So stuck waiting for some code to look at to answer the questions (random commits on the commits list doesn’t really do it like the finished product – i really hate commit lists).

So I can’t currently comment on the performance schema work much at all, nor if it’s useful to Drizzle. Hopefully will soon.

row id in MySQL and Drizzle (and the engines)

Some database engines have a fundamental concept of a row id. The row id is everything you need to know to locate a row. Common uses include secondary indexes (key is what’s indexed, value is rowid which you then use to lookup the row).

One design is the InnoDB method of having secondary indexes have the value in the index be the primary key of the row. Another is to store the rowid instead. Usually (or often… or sometimes…) rowid is much smaller than the pkey of the row. This is how innodb can answer some queries just out of the index. If it used rowid, it may involve more IO to answer the query. All this is irrelevant if you never want just the primary key from a secondary index.

Some engines are designed from the start to have rowid, others it’s added later (e.g. NDB).

Anyway… all beside the point. Did you know you can do this in mysql or drizzle:

drizzle> create table t1 (a int primary key);
Query OK, 0 rows affected (0.02 sec)

drizzle> insert into t1 (a) values (1);
Query OK, 1 row affected (0.01 sec)

drizzle> select _rowid from t1;
+--------+
| _rowid |
+--------+
|      1 |
+--------+
1 row in set (0.00 sec)

Is that the rowid from the engine? No (although at least NDB will let you select the real ROWID through a pseudo column through NDBAPI). Quoting from the MySQL manual:

If a PRIMARY KEY or UNIQUE index consists of only one column that has an integer type, you can also refer to the column as _rowid in SELECT statements.

Unfortunately, this isn’t correct… as this lovely bit of “oh my, what an excellent way to obfuscate my database app!” shows:

drizzle> create table t1 (a int primary key, b varchar(100));
Query OK, 0 rows affected (0.02 sec)

drizzle> insert into t1 values (1,”foo”);
Query OK, 1 row affected (0.00 sec)

drizzle> update t1 set b=”foobar!” where _rowid=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

drizzle> select * from t1;
+—+———+
| a | b |
+—+———+
| 1 | foobar! |
+—+———+
1 row in set (0.00 sec)

So how is this implemented? In two places: in sql_base.cc find_field_in_table() and in table.cc during FRM parsing (this is how I found it). We can even do things Oracle can’t (insert, update and delete):

drizzle> update t1 set a=2 where _rowid=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

drizzle> select * from t1;
+---+---------+
| a | b       |
+---+---------+
| 2 | foobar! |
+---+---------+
1 row in set (0.00 sec)

drizzle> update t1 set _rowid=3 where _rowid=2;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

drizzle> select * from t1;
+---+---------+
| a | b       |
+---+---------+
| 3 | foobar! |
+---+---------+
1 row in set (0.00 sec)

SQLite also has something similar (see the autoinc docs).

I do wonder if anybody uses this functionality. It’s even tested (I was quite shocked at this) in the auto_increment and heap_auto_increment tests.

People on IRC as some measure of a project

#mysql isn’t too fair to include, as it’s really about users, not dev. #mysql-ndb is there because i heart ndb.

Oh, and linux.conf.au is there because it’s *awesome* and you should go.

Totally unscientific due to i’m only taking a sample once and whatever… but it kinda interests me…

magic number super fun happy time

umm…..

int Field_timestamp::store(double nr)
{
  int error= 0;
  if (nr < 0 || nr > 99991231235959.0)
  {
    set_datetime_warning(DRIZZLE_ERROR::WARN_LEVEL_WARN,
                         ER_WARN_DATA_OUT_OF_RANGE,
                         nr, DRIZZLE_TIMESTAMP_DATETIME);
    nr= 0;					// Avoid overflow on buff
    error= 1;
  }
  error|= Field_timestamp::store((int64_t) rint(nr), false);
  return error;
}

(likely the same in mysql as well… haven’t checked though). these date and time things scare me.

Speaker: MySQL Conference & Expo 2009 – O’Reilly Conferences, April 20 – 23, 2009, Santa Clara, CA

Yes, I’m speaking at  the upcoming MySQL Conference & Expo 2009 – on April 20 – 23 (and yes, it’s in Santa Clara again).

I have three sessions:

MySQL Cluster Tutorial: this time with 6.4 feature goodness. Very hands-on, very interactive.

MySQL Cluster on Windows:  (insert witty text about hating operating system freedom here)

Memory Management in MySQL and Drizzle: not magic setting of buffer variables, but memory allocation and management inside the server, a bunch of malloc() discussion and hopefully some interesting numbers.

fixing drizzle on linux sparc

Since I got fed up with Solaris the other day, the T1000 is running Debian. This means that “I’ll care about Drizzle on Linux Sparc”.

OMG were things broken in the most “trivial” ways.

A good quick intro to the issues is Memory alignment on SPARC, or a 300x speedup!

It all comes down to memory alignment.

So I pulled the MySQL 6.0 bzr tree onto the box to try it too… I haven’t seen so many compiler warnings in ages (okay… since I last built MySQL.. drizzle is warning-clean and it makes it hard to remember a time before that). I think it works purely by accident.

So I’m gradually getting all of Drizzle working on Linux Sparc (a few things fixed already).

It’d be great if the T1k had faster disk though (make -j30 is fun… but IO isn’t on a single 160GB 7200rpm disk)… anybody wanting to donate an SSD?

Stewart learns SQL oddities…

What would you expect the following to fail with?

CREATE TABLE t1 (a int, b int);
insert into t1 values (100,100);
CREATE TEMPORARY TABLE t2 (a int, b int, primary key (a));
BEGIN;
INSERT INTO t2 values(100,100);
CREATE TEMPORARY TABLE IF NOT EXISTS t2 (primary key (a)) select * from t1;

If you answered ER_DUP_ENTRY then you are correct.

From the manual:

Note

If you use IF NOT EXISTS in a CREATE TABLE ... SELECT statement, any rows selected by the SELECT part are inserted regardless of whether the table already exists.

Does anybody else find this behaviour “interesting”?

Temporary tables in any engine….

Well… nearly any engine.

I have a plan forming in my head to add some hooks to engines to help with creating temporary tables (the ones created while executing a query, not ones created during ALTER TABLE).

Currently, if you ALTER TABLE and we require a temporary table, it’s still database.table but we generate a table name that’s small, unique and begins with “#sql”.

I’ve changed some of the handler interface to accept two strings (database name, table name) instead of one “path” that may (or may not) end in “.FRM” and may (or may not) begin with “./” and may (or may not) use the “/” separator between database and table name (hint: on win32, it’s sometimes “”).

So ha_delete_table is now: ha_delete_table(db, table_name). Sanity!

(this has the downside of being a incompatible change that doesn’t break the build as there seems to be no way in C++ to say “derived classes cannot implement a function of this name”).

This is annoying for temporary tables however.

They don’t reside in a database… they’re off in la-la land (otherwise known as opt_drizzle_tmpdir).

Now… in MySQL 5.1 the ability for multiple temporary directories was added, and the MySQL server will cycle through them. The clusterfuck part of this was that a mutex was added… so every time the code goes to get the name of a (or for 99.99999999% of cases, the) temporary directory, it has to grab a mutex. With modern systems being able to have *many* simultaneous IO operations (e.g. create files) this is just dumb. It’s gone.

If you want to use multiple spindles for temporary tables it’s called RAID people (or buy an SSD, mkfs.ext2 it and just wipe it on reboot. simple.)

Anyway… back to temporary tables:

they’re in /tmp or something. So an API that’s foo(const char *db, const char *table_name) doesn’t work as well.

Also, for a bunch of engines, it’s good to know that you’re using a temporary table. You probably want to store these somewhere that never needs fsync() or anything like that. If the server goes away, these tables are *gone*. So optimise for that.

Heck, it may even be good to store temporary tables in your distributed engine (if that engine is memory based) as it’s often faster to access remote memory than local disk (although with SSD this is a whole different ball game… in fact, I’m not sure if it’s even still a ball game.. it’s possibly frisbee)

But currently there’s hardcoded a mi_create (MyISAM create) call in the server and in the 6.0-maria tree, an #ifdef around if it’s mi_create or maria_create.

So not any engine yet… but one can dream. Yes, I dare to dream.

(although why I was dreaming of a small board with a PowerPC 603, 8MB RAM and a mini DVI port the other night is quite beyond me)

I don’t read code comments

They are wrong.

Misleading at best.

Reworking parts of Drizzle (which came directly from MySQL) it can get painfully obvious. Things like “afaiu” and “???” appear in more than one place (that is if the comment isn’t just obviously wrong).

A comment merely states what one person thought the code did at some point in the past. It has no relation to what the code actually does now.

Hold me, I’m scared….

From ha_myisam.cc:

/*
TODO: switch from protocol to push_warning here. The main reason we didn’t
it yet is parallel repair. Due to following trace:
mi_check_print_msg/push_warning/sql_alloc/my_pthread_getspecific_ptr.

Also we likely need to lock mutex here (in both cases with protocol and
push_warning).
*/
protocol->prepare_for_resend();
protocol->store(name, length, system_charset_info);
protocol->store(param->op_name, system_charset_info);
protocol->store(msg_type, system_charset_info);
protocol->store(msgbuf, msg_length, system_charset_info);
if (protocol->write())
sql_print_error(“Failed on my_net_write, writing to stderr instead: %s\n”,
msgbuf);
return;

Hopefully this will serve as a good TODO list to go and fix at some point.

What VERSION in INFORMATION_SCHEMA.TABLES means (hint: not what you think)

It’s the FRM file format version number.

It’s not the version of the table as one might expect (i.e. after CREATE it’s 1. Then, if you ALTER, it’s 2. Alter again 3 etc).

In Drizzle, we now return 0.

In future, I plan that Drizzle will allow the engine to say what version it is (where 0 is “dunno”).

This’ll be a good step towards being able to cope with multiple versions of a table in use at once (and making sense of this to the user).

drop table fail (on the road to removing the FRM)

So… in removing the FRM file in Drizzle, I found a bit of a nugget on how drop table works (currently in the MySQL server and now “did” in Drizzle).

If you DROP TABLE t1; this is what happens

  • open the .frm file
  • read first 10bytes (oh, and if you get EIO there, in a SELECT * FROM INFORMATION_SCHEMA.TABLES you’ll get “Error” instead of “Base Table”)
  • if (header[0] != (unsigned char) 254 || header[1] != 1 ||
    (header[2] != FRM_VER && header[2] != FRM_VER+1 &&
    (header[2] < FRM_VER+3 || header[2] > FRM_VER+4)))
    return true;
    Which means that you probably (well, should have) set your enum legacy_db_type to DB_TYPE_UNKNOWN in the caller of bool mysql_frm_type(Session *, char *path, enum legacy_db_type *dbt) otherwise you end up in some form of pain.
  • Else, *dbt= (enum legacy_db_type) (uint) *(header + 3);
    return true;                   // Is probably a .frm table

I do like the “probably”.

Oh, and on a “storage engine api” front, some places seem to expect handler::delete_table(const char* name) to return ENOENT on table not existing. In reality however:

  • int ha_heap::delete_table(const char *name)
    {
    -  int error= heap_delete_table(name);
    -  return error == ENOENT ? 0 : error;
    +  return heap_delete_table(name);
    }
  • InnoDB (note the behaviour of returning DB_TABLE_NOT_FOUND… which isn’t ENOENT)
    err = DB_TABLE_NOT_FOUND;
    ut_print_timestamp(stderr);

    fputs(“  InnoDB: Error: table “, stderr);
    ut_print_name(stderr, trx, TRUE, name);
    fputs(” does not exist in the InnoDB internaln”
    “InnoDB: data dictionary though MySQL is”
    ” trying to drop it.n”
    “InnoDB: Have you copied the .frm file”
    ” of the table to then”
    “InnoDB: MySQL database directory”
    ” from another database?n”
    “InnoDB: You can look for further help fromn”
    “InnoDB: http://dev.mysql.com/doc/refman/5.1/en/”
    “innodb-troubleshooting.htmln”,
    stderr);

  • and MyISAM would generate the error message itself, but that’s fixed with:
    -  if (my_delete_with_symlink(from, MYF(MY_WME)))
    +  if (my_delete_with_symlink(from, 0))
    return(my_errno);

and just to add to the fun, elsewhere in the code, a access(2) call on the frm file name is used to determine if the table exists or not.

The road to removing the FRM has all sorts of these weird-ass things along it. Kinda nice to be able to replace this with something better (and, hopefully – good).

But let me sum up with sql_table.cc:

“This code is wrong and will be removed, please do not copy.”

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

*cough*

(and then wipe coffee off the computer)

of course the real aim should be to scale with one instance on the machine as scaling with multiple instances on the one machine isn’t scaling at all – it’s scale out, but with more problems (now when one machine goes down, so do 1110202434 database instances).

Technology predictions

In 2 years (ish):

  • the majority of consumer bought machines (which will be laptops) will have SSD and not rotational media
  • At the same time, servers with larger storage requirements will use disk as we once used tape.
  • At least one Linux distributoin will be shipping with btrfs as default
  • OpenSolaris will be looking interesting and not annoying to try out (a lot more “just work” and easy to get going).
  • Unless Sun puts ZFS under a GPL compatible license so it can make it into the Linux kernel, it will become nothing more than a Solaris oddity as other file systems will have caught up (and possibly surpassed).
  • There will be somebody developing a a MySQL compatible release based off Drizzle
  • Somebody will have ported Drizzle back to Microsoft Windows… possibly Microsoft.
  • X will still be used for graphics on Linux, although yet another project will start up to “replace X with something modern”, get a lot of press and then fail.

In 5 years:

  • Apple will single handedly control 1/3rd the mobile phone market
  • The other 2/3rds will be divided between Blackberry (small), Windows Mobile and Android.
  • Linux desktop market share will be much higher than Apple’s

That’s all for now…

libmallocfail

Bazaar branches of libmallocfail

Simple LD_PRELOAD library that will take parameters via environment variables and cause malloc() to occationally fail.

Aim was to use this to test bits of MySQL/Drizzle although since their libtool based stuf, the binary in tree is a libtool shell script, and I haven’t found a way to LD_PRELOAD only for mysqld and not the shell script and the other processes spawned by it.

I have found a bug in libc though :)

Goodbye FRM (or at least the steps to it)

Since before MySQL was MySQL, there has been the .FRM file. Of course, what it really wanted to be was “.form” -  a file that stored how to display a form on your (green) CRT. Jan blogged earlier in the year on this still being there, even in MySQL 5.1 (albeit not in any useful form).

So why do we want it to die?

Well… it’s not exactly very useful anymore.

There are a few things it’s used for….

If database/table.frm exists, the table exists (or, on Windows, you may also get databasetable.frm). This is tested in a few bits in the code by a call to access(2).

Most engines have their own data dictionaries (Innodb, PBXT, NDB, Falcon). Keeping these in sync with the FRMs can be problematic at best. This is especially true with distributed engines such as NDB.

The current solution is that on the SQL node that is creating the table, we create the FRM file, gzip it, and store it in the cluster. Then, other nodes, if they go “err… no local frm” first call ha_create_table_from_engine() which NDB will go and see if the table exists in the cluster. If so, it copies the FRM from the cluster to local disk and then the SQL server continues on its way with the standard way of opening a table (through the FRM). If you do DDL through the NDB-API (and not via SQL) then well… you get to keep both pieces.

As for if you crash during a table rename (with any engine with its own data dictionary.. e.g. InnoDB)… you again get to keep both pieces. (There is a bit of discussion on this over here)

Having FRM files also doesn’t especially lead to having multiple versions of table metadata co-existing in the server.

The fun part of reading a frm is open_binary_frm in table.cc. It reads in the frm into a TABLE_SHARE. If we only had some other way of filling out a TABLE_SHARE… one from the engine itself…

But what about any metadata that the engine data dictionary doesn’t have? For example, many server types may map to 1 engine type. An example of this is the GIS types in MySQL. For most engines, these just map to BLOBs. The engine itself has no knowledge about that, but we should fill out the table definition correctly…. so for this type of thing the engine may need to store some additional metadata. This is pretty easy for transactional engines: put it in a table! (although you then have your own problem about keeping this synchronised with any DDL). For engines that don’t have their own data dictionary, we can just provide a set of routines to store/read a frm type file (based on protobufs no doubt).

There also seems to be some entanglement with LOCK_open. Ahhh LOCK_open, the lock that nobody can possibly understand.

The tricky thing will be not rewriting every little bit from scratch all at once but rather go for the incremental bits….