Drizzle Tarballs for next milestone: aloha

Wanting a quick build-and-play way to get Drizzle? We’re dropping weekly-ish tarballs for the Aloha milestone. The latest milestone also has preliminary GCC 4.4 support

You can see regular announcements on:

Pluggable Metadata stores (or… the revenge of table discovery)

Users of the ARCHIVE or NDB storage engines in MySQL may be aware of a MySQL feature known as “table discovery”. For ARCHIVE, you can copy the archive data file around between servers and it magically works (you don’t need to copy the FRM). For MySQL Cluster (NDB) it works so that when you CREATE TABLE on another MySQL server,  other MySQL servers can get the FRM for these tables from the cluster.

With my work to replace the FRM with a protobuf structure in Drizzle and clean up parts of the API around it, this feature didn’t really survive in any working state.

Instead, I’m now doing things closer to the right way: pluggable metadata stores. The idea being that the whole “table proto on disk” (in MySQL it’s the FRM, but in Drizzle we’re now using a protobuf structure) code is pluggable and could be replaced by an implementation specific to an engine (e.g. the innodb or ndb data dictionaries) or a different gerenic one.

Currently, the default plugin is the same way we’ve been doing it forever: file-per-table on disk in a directory that’s the database. The API has a nasty bit now (mmmm… table name encoding), but that’ll be fixed in the future.

The rest of this week will be dedicated to plugging this into all the bits in the server that manipulate the files manually.

With luck, I’ll have modified the ARCHIVE engine by then too so that there’ll just be the archive data file on disk with the table metadata stored in it.

Size of Storage Engines

For whatever reason, let’s look at “Total Physical Source Lines of Code” from a recent mysql-6.0 tree (and PBXT from PBXT source repo):

See the spreadsheet here.

Raw data:

Blackhole        336
CSV             1143
Archive         2960
MyISAM         34019
PBXT           41732
Maria          69019
InnoDB         82557
Falcon         91158
NDB           365272

NDB has a 100,000 line test suite.

PBXT supports MySQL and Drizzle.

Conclusions to draw? Err… none really.

Does linux fallocate() zero-fill?

In an email disscussion for pre-allocating binlogs for MySQL (something we’ll likely have to do for Drizzle and replication), Yoshinori brought up the excellent point of that in some situations you don’t want to be doing zero-fill as getting up and running quickly is the most important thing.

So what does Linux do? Does it zero-fill, or behave sensibly and pre-allocate quickly?

Let’s look at hte kernel:

Inside the fallocate implementation (fs/open.c):

if (inode->i_op->fallocate)
ret = inode->i_op->fallocate(inode, mode, offset, len);
else
ret = -EOPNOTSUPP;

and for ext4:
/*
* currently supporting (pre)allocate mode for extent-based
* files _only_
*/
if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
return -EOPNOTSUPP;

XFS has always done fast pre-allocate, so it’s not a problem there and the only other filesystems to currently support fallocate are btrfs and ocfs2 – which we don’t even have to start worrying too much about yet :)

But this is just kernel behaviour – i *think* libc ends up wrapping it
with a ENOTSUPP from kernel being “let me zero-fill” (which might be
useful to check). Anybody want to check libc for me?

This was all on slightly post 2.6.30-rc3 (from git: 8c9ed899b44c19e81859fbb0e9d659fe2f8630fc)

c++ stl bitset only useful for known-at-compile-time number of bits

Found in the libstdc++ docs:

Extremely weird solutions. If you have access to the compiler and linker at runtime, you can do something insane, like figuring out just how many bits you need, then writing a temporary source code file. That file contains an instantiation of bitset for the required number of bits, inside some wrapper functions with unchanging signatures. Have your program then call the compiler on that file using Position Independent Code, then open the newly-created object file and load those wrapper functions. You’ll have an instantiation of bitset for the exact N that you need at the time. Don’t forget to delete the temporary files. (Yes, this can be, and has been, done.)

Oh yeah – feel the love.

Brought to you by the stl-is-often-worse-for-you-than-meth dept.

Drizzle low-hanging-fruit

We have an ongoing Drizzle milestone called low-hanging-fruit. The idea is that when there’s something that  could be done, but we don’t quite have the time to do it immediately, we’ll add a low-hanging-fruit blueprint so that people looking to get a start on the codebase and contributing code to Drizzle have a place to go to find things to do.

Some of my personal favourites are:

Also relatively low hanging fruit can be writing some plugins. Some simple plugin types include:

  • Authentication
    Got somewhere that you could authenticate against for connecting to a DB? Write a plugin for it! Current auth plugins are auth_http and auth_pam.

    • Perhaps you want to authenticate against a central DB? checking in memcached first?
    • Perhaps a htaccess style method
  • Functions
    Apply some function to a column. These are pretty simple to write (see md5, compress examples). Perhaps interfaces to encryption/decryption? a hashing function?

    • ROT13
    • 3DES
    • AES
      Bonus points if you get any of these to use the T2000 crypto accellerator stuff
    • ID3 tag decoding
    • file type detection (well.. BLOB)

So there’s a fair bit you can do to get started. Best of all, you can chat with the Drizzle developers next week at the MySQL Conference and Expo and Drizzle Developer Day.

Drizzle Developer Day reminder

We’re having a Drizzle Developer Day just after the MySQL Conference and Expo next week. You don’t have to be attending the conference to come to the Drizzle Developer Day. Just bring your enthusiasm for free databases, Drizzle and good software. Spaces are limited, so head on over to the signup page and fill in your name if you haven’t already

If coming from the Hyatt Regency Santa Clara (where the MySQL Conference and Expo is), at least I will be driving from there, so let me know if you want a lift.

Using Dtrace to find out if the hardware or Solaris is slow (but really just working around the problem)

A little while ago, I was the brave soul tasked with making sure Drizzle was working properly and passing all tests on Solaris and OpenSolaris. Brian recently blogged about some of the advantages of also running on Solaris and the SunStudio compilers – more warnings from the compiler is a good thing. Many kudos goes to Monty Taylor for being the brave soul who fixed most of the compiler warnings (and for us, warnings=errors – so we have to fix them) for the SunStudio compilers before I got to making te tests work.

So, I got to the end of it all and got pointed to an OpenSolaris x86 box where the drizzleslap test was timing out. The timeout for tests is some amazingly long amount of time – 15 minutes. All the drizzle-test-run tests are rather short tests.

To make running the tests quick, I usually LD_PRELOAD libeatmydata – a simple way of disabling pesky things like fsync that take a long time (rumors that I nickname it libmacosxsimulation are entirely true). It’s pretty simple to build libeatmydata on Solaris too (I periodically do this and always intend to check in the associated Makefile but never do).

Unfortunately, on OpenSolaris a bunch of things are built 32bit and others 64bit and just doing “LD_PRELOAD=libeatmydata.so ./dtr” doesn’t work – I’d have to modify the test script to only do the LD_PRELOAD for drizzled – which is annoying.

On my T1000 running Debian, the drizzleslap test takes 42 seconds to complete with libeatmydata, or 393 seconds when it’s really doing fsyncs. So for it to be timing out on this OpenSolaris x86 box – i.e. taking more than 15 minutes, was strange.

So… what was going on? Step 1: is anything actually going on? One way to test this is to see if disk IO is being generated. On Linux, we can use “iostat”. On Solaris, we can use “zpool iostat”. Things were going to disk for the whole time of the test. Time to compare what the difference between the platforms is.

Well.. a typical way that tests have taken forever have been because of lots of transactions: i.e. lots of fsync(). You are then dependent on the fsync() performance.

If we look at “iostat -x” and the avgrq-sz field on Linux, we’ll see that the average request size is on 10-12 sectors (512 byte blocks). i.e. about 5 or 6kb.

If we look at “zpool iostat 1” on OpenSolaris, we see a bit of a different story, but similar enough that you could safely assume that lots of small synchronous IOs were going on. After a bit of reading of the ZFS on-disk format documents, I had a slightly better idea what was going on that could be causing me seeing a larger average request size on ZFS than on Linux with XFS.

So… perhaps it’s the speed of these syncs? Ordinarily, I’d just write up a quick LD_PRELOAD library that wraps fsync() and times it (perhaps writing to a file so I could do analysis on it later). Since I was working on Solaris… I thought I’d try DTrace. Some google-foo and dtrace hacking later, I tried this:

stewart@drizzle-dev:~/drizzle/sparc$ time pfexec dtrace -n ‘syscall::fdsync:entry /execname == “drizzled” / { self->ts[self->stack++] = timestamp; } syscall::fdsync:return /self->ts[self->stack – 1]/ { this->elapsed = timestamp – self->ts[–self->stack]; @[probefunc] = count(); @a[probefunc] = quantize(this->elapsed); self->ts[self->stack] =0; }’

dtrace: description 'syscall::fdsync:entry ' matched 2 probes
^C

  fdsync                                                         1600
  fdsync
           value  ------------- Distribution ------------- count
        33554432 |                                         0
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1520
       134217728 |@@                                       79
       268435456 |                                         1
       536870912 |                                         0        

real	4m26.837s
user	0m0.657s
sys	0m0.566s

Which did seem like an awful long time for an fsync() to take. Although the filesystem was on a single disk, it was meant to be made remotely recently, and it’s sitting on a Sun controller… so it should be a bit better than that. From reading some of the ZFS on-disk spec, it could be some bug that means we’re waiting for a checkpoint to be written instead of forcing the sync out when we call fsync() – but I sought another solution (as on other Solaris/OpenSolaris systems this wasn’t a problem – so perhaps fixed in newer kernels or it’s a driver issue).

So I went and added “–commit=100” to a bunch of places in the drizzleslap test to batch things into transactions. The idea being to greatly reduce the number of fsync() calls to bring the execution time of the drizzleslap test on the machine to get below 15minutes. A bit of jiggerypokery later (some tests needed to not have the –commit to avoid various locking foo) and I had something that should run.

Now, ~113 seconds on the T1000 on Linux (with a single SATA disk, down from an original 393 seconds) and ~437 seconds on the OpenSolaris box. For giggles, tried it on a Solaris box that’s running UFS on a 10k RPM SAS drive: ~44 seconds.

In Summary:

T1000, Linux, libeatmydata, XFS: ~42 seconds (before optim)
T1000, Linux, 7200RPM SATA, XFS: ~113 seconds
T5240, Solaris 10, 10k RPM SAS, UFS: ~44 seconds
16 core Xeon, OpenSolaris, 7200RPM, ZFS: ~437 seconds

So, on that hardware setup – something is strange. The 10k SAS drive on UFS on the CoolThreads box is really nice though…. makes me want that kind of disk here.

This page was useful, and I used it as a basis for some of my DTrace scripts: http://fav.or.it/post/1146360/dtrace-and-the-mighty-hercules

Also thanks to several people on #opensolaris on Freenode who helped me out with various Solaris specific commands in tracking this down.

No implicit defaults

See also: MySQL Bug 43151

The MySQL Manual proudly states that you don’t get implicit default values if strict mode.

mysql> set sql_mode='STRICT_ALL_TABLES';
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (a int, b int, c int, d int, e int, f int, g int, h int,i int,
primary key (a,b,c,d,e,f,g,i,h)) engine=MyISAM;
Query OK, 0 rows affected (0.03 sec)

mysql> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `a` int(11) NOT NULL default '0',
  `b` int(11) NOT NULL default '0',
  `c` int(11) NOT NULL default '0',
  `d` int(11) NOT NULL default '0',
  `e` int(11) NOT NULL default '0',
  `f` int(11) NOT NULL default '0',
  `g` int(11) NOT NULL default '0',
  `h` int(11) NOT NULL default '0',
  `i` int(11) NOT NULL default '0',
  PRIMARY KEY  (`a`,`b`,`c`,`d`,`e`,`f`,`g`,`i`,`h`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> insert into t1 (a) values (1);
Query OK, 1 row affected (0.00 sec)

Which means your getting a default value you never specified.

In my latest Drizzle patch, we no longer give you what you didn’t ask for (an implicit default).

You can still (of course) specify an explicit default (and I’ve done this in some of our test cases).

Improving the Storage Engine “API”

I increasingly enclose the API part of “Storage Engine API” in quotes as it does score a rather large number on the API Design Rusty levels (Coined by Rusty Russell). I give it a 15 (out of 18. lower is better) in this case “The obvious use is wrong”.

The ideas is that your handler gets called to write a row (the amazingly named handler::write_row()). It’s passed a buffer which is the row to be stored. An engine that uses the MySQL row format (lets say, ARCHIVE) will simply pack the row and write it out.

Unless there is a TIMESTAMP field with auto set on insert. Up until now (and still now in MySQL) the engine had to check if this was the case and make sure the timestamp field was updated.

To remove this particular bonghit is actually a really small patch, which Jay recently got merged:

~drizzle-developers/drizzle/development : revision 873.1.16

Hopefully somebody does this soon for MySQL as well.

linux.conf.au 2009 wrap-up (incl Open Source Databases Mini-conf): Day 0-1

It’s no secret that I love linux.conf.au. My first was linux.conf.au 2003, in Perth and I’ve been to every one since (there are at least two people who’ve been to every single one, including CALU as it was called in 1999).

I’ve been on the board of Linux Australia for some insane proportion of the years since then (joining in 2003). Linux Australia is the not-for-profit community organisation that puts on linux.conf.au. It’s all volunteers and amazingly enough we have more than one group of people wanting to put on linux.conf.au each year!

This year, we Marched South to Hobart.

Here I detail what I saw, what I wish I saw and whatever else comes to mind.

Sunday – Before the conference

Ran into Bdale while checking in. Short flight down. A million and one people on the plane and on the ground that I knew. It must be linux.conf.au.

Seeing way too many awesome people I know, checking into accommodation (oh my, what a hill), registering for conf, beer and then off to a “ghosts of conferences past” dinner – where a few people who had organised previous linux.conf.au’s were hastily gathered together to chat to part of the 2010 team.

Monday – Open Source Databases Miniconf Day 1

Oh, that’s right – I’m running the OSDB Miniconf :)

First up, Monty Taylor spoke on “NDB/Bindings – Use the MySQL Cluster Direct API from languages you actually like for fun and profit”. Possibly taking the prize for the longest talk title of the conference. The NDB API is not SQL, it’s what the MySQL server (and one day, when Monty and I get around to it, Drizzle) translates SQL into for NDB. That being said, you can (pretty much always) write NDB API code that dramatically outperforms equivilent SQL (for a variety of reasons). Monty maintains the NDB/Bindings project that lets you use languages other than C++ for the NDB API.

At the same time as Monty was speaking, I wish I’d been able to fork() and go and see “Is Parallel Programming Hard, And, If So, Why?by Paul McKenney and Michael Still talking about MythNetTV (pull RSS feeds of video in as MythTV programs).

After morning tea, we were meant to have “InnoDB scaling up and performance” by Bruce Huang, but he was a no-show. Hint: if you don’t want bad things to be said about you by conference organisers, either show up or let them know you’re not able to make it.

Instead, we led a crazy Q&A type session around the room which was a whole lot of fun. Really a “ask the experts” meets running up-and-down stairs with a microphone.

Next up, Arjen Lentz who runs Open Query spoke on “OurDelta: Builds for MySQL”. The best way to describe OurDelta is a “distribution of MySQL”. It’s the MySQL server plus a bunch of patches provided by various people that haven’t yet made it into the main source tree (for any number of reasons).

At the same time (if you’ve never been to linux.conf.au, you’ll find that you often want to be in at least 3 places at once) I would have really liked to see “MythTV Internals by Nigel Pearson” (I co-wrote Practical MythTV with Michael Still, which is having a “second edition” in wiki form over at http://www.mythtvbook.com/) as well as the panel on geek parenting as this may be something I’m one day faced with.

Up next: Russell Coker filled in for Kaigai (same talk, different speaker) to talk on The Security-Enhanced PostgreSQL – “System-wide consistency” in access controls. I found this quite interesting and different approaches to database security are worth looking at. Modern applications (read: web applications) don’t map their uses to database users at all. There are usually two users on the database server: the super user and the user that the app uses. It would be nice to have a good solution for those who want it.

Again, If I had the ability to be in two places at once, I would have also seen “How I Learned To Stop Worrying And Love ACPI” by the extremely handsome Matthew Garrett.

Monty Widenius (blog here – and yes, we have two Monty’s now… which does cause confusion) talking about the Maria storage engine. Maria is based on MyISAM, but adding crash safety and transactions (among other things).

Again, if I was able to be in several places at once I would have also seen Rusty‘s “Large CPUmasks”, Nathan Scott talking about “System level performance management with PCP” and Bdale’s “Collaborating Successfully with large corporations”.

An awesome start to the conference.

Fun with the 387

Filed  GCC bug 39228:

#include <stdio.h>
#include <math.h>
int main()
{
        double a= 10.0;
        double b= 1e+308;
        printf("%d %d %dn", isinf(a*b), __builtin_isinf(a*b), __isinf(a*b));
        return 0;
}

mtaylor@drizzle-dev:~$ gcc -o test test.c
mtaylor@drizzle-dev:~$ ./test
0 0 1
mtaylor@drizzle-dev:~$ gcc -o test test.c -std=c99
mtaylor@drizzle-dev:~$ ./test
1 0 1
mtaylor@drizzle-dev:~$ gcc -o test test.c   -mfpmath=sse -march=pentium4
mtaylor@drizzle-dev:~$ ./test
1 1 1
mtaylor@drizzle-dev:~$ g++ -o test test.c
mtaylor@drizzle-dev:~$ ./test
1 0 1

Originally I found the simple isinf() case to be different on x86 than x86-64, ppc32 and sparc (32 and 64).

After more research, I found that x86-64 uses the sse instructions to do it (and using sse is the only way for __builtin_isinf() to produce correct results). For the g++ built version, it calls __isinf() instead of inlining (and as can be seen, the __isinf() version is always correct).

Specifically, it’s because the optimised 387 code is doing the math in double extended precision inside the FPU. 10.0*1e308 fits in 80bits but not in 64bit. Any code that forces it to be stored and loaded gets the correct result too. e.g.

mtaylor@drizzle-dev:~$ cat test-simple.c

#include <stdio.h>
#include <math.h>
int main()
{
        double a= 10.0;
        double b= 1e+308;
    volatile    double c= a*b;
        printf("%dn", isinf(c));
        return 0;
}

mtaylor@drizzle-dev:~$ gcc -o test-simple test-simple.c
mtaylor@drizzle-dev:~$ ./test-simple
1

With this code you can easily see the load and store:

 8048407:       dc 0d 18 85 04 08       fmull  0x8048518 804840d:       dd 5d f0                fstpl  -0x10(%ebp)
 8048410:       dd 45 f0                fldl   -0x10(%ebp)
 8048413:       d9 e5                   fxam

While if you remove volatile, the load and store doesn’t happen (at least on -O3, on -O0 it hasn’t been optimised away):

 8048407:       dc 0d 18 85 04 08       fmull  0x8048518
 804840d:       c7 44 24 04 10 85 04    movl   $0x8048510,0x4(%esp)
 8048414:       08
 8048415:       c7 04 24 01 00 00 00    movl   $0x1,(%esp)
 804841c:       d9 e5                   fxam

This is also a regression from 4.2.4 as it just calls isinf() and doesn’t expand the 387 code inline. My guess is the 387 optimisation was added in 4.3.

Recommended fix: store and load in the 387 version so to operate on same precision as elsewhere.

Now I just have to make a patch I like that makes Drizzle behave because of this (showed up as a failure in the SQL func_math test) and then submit to MySQL as well… as this may happen there if “correctly” built.

floating point is not fun

#include <stdio.h>
#include <math.h>

int main()
{
        double a= 10.0;
        double b= 1e+308;
        printf("%dn",isinf(a * b));
        return 0;
}

Prints 1 on: 64bit intel, 32bit PowerPC, 32bit SPARC, 64bit Sparc. But prints zero on 32bit intel.

Oh, but if you build that with g++ instead of gcc on 32bit Intel, you get 1.

The FRM file format

It’s fortunate that I’m watching Veronica Mars again with a mate; a more-than-you-think amount of detective work is required to understand the relationship (and format) of the TABLE_SHARE, the FRM file and HA_CREATE_INFO. Oh, also you’ll need drizzled/base.h and drizzled/structs.h and drizzled/table_share.h is also a good one to have open.

The FRM file is really a FoRM file from UNIREG (see copies of really old mysql docs around the place or even better, the links off Sheeri‘s blog post). Also, Jan has some thoughts on FRM too and Thava has a scary frmdump php script.

I have to agree completely with Jan:

  • “the internals document is missing all the interesting parts”
    I’ve just read the source. Everything in the internals doc is easily gotten from the source, so when I finally did take a close look it was “I know all this”. I can’t fault the docs team here at all – I’d place this 3rd from the bottom in priorities. In fact, it’s better to fix it than to document it.
  • ” get the hands dirty and get into the code … it got really dirty”
    Oh yeah – and it’s only gotten worse with things added to it.

It contains interesting nuggets like unireg_check (or unireg_type, depending on where you read) that does:

 enum utype  { NONE,DATE,SHIELD,NOEMPTY,CASEUP,PNR,BGNR,PGNR,YES,NO,REL,
                CHECK,EMPTY,UNKNOWN_FIELD,CASEDN,NEXT_NUMBER,INTERVAL_FIELD,
                BIT_FIELD, TIMESTAMP_OLD_FIELD, CAPITALIZE, BLOB_FIELD,
                TIMESTAMP_DN_FIELD, TIMESTAMP_UN_FIELD, TIMESTAMP_DNUN_FIELD};

But really only the timestamp things… which should be default magic, but it’s somewhere tied in (Jay had thoughts last time I spoke to him… hopefully going away soon). A bunch of these aren’t ever used and are just relics from UNIREG. In fact… I went and removed what wasn’t needed and just ended up with:

  enum utype  { NONE,
                NEXT_NUMBER,
                TIMESTAMP_OLD_FIELD,
                TIMESTAMP_DN_FIELD, TIMESTAMP_UN_FIELD, TIMESTAMP_DNUN_FIELD};

Which does seem a bit nicer. The  fact that TIMESTAMP_OLD_FIELD is used as in interim value is, wel, scary. At least with a smaller set of possiblities it will be easier to convert into the proto format.

A hint of a brighter future is in the comment there:

/*
    We use three additional unireg types for TIMESTAMP to overcome limitation
    of current binary format of .frm file. We'd like to be able to support
    NOW() as default and on update value for such fields but unable to hold
    this info anywhere except unireg_check field. This issue will be resolved
    in more clean way with transition to new text based .frm format.
    See also comment for Field_timestamp::Field_timestamp().
  */

Hrrm… a text based FRM? That would be much nicer to read the code for. Unfortunately, it doesn’t really exist. Some FRMs are text in MySQL, but not ones to do with tables. You can look at the FRM for a VIEW in a text editor and see the SQL quite easily (the file format is text).

So I can’t go look at any nice text based format code – it’s all uint2korr() and friends. Yes folks, this is about the only place left in the code with function names in Swedish. What does korr mean? “accurate, correct, correctly”. If you look at korr.h, you’ll see that it’s just for storing in machine independent format: low byte first.

My favourite korr functions:

  • uint3korr
    which reads 4 bytes, so remember to alloc it, initialise it or Valgrind will make you its bitch.
  • uint5korr
    err… 5 bytes of course
  • uint6korr
    6 bytes (getting the pattern now)
  • uint7korr
    which doesn’t actually exist. Nobody loves Seven – George Costanza was wrong.

It also (as Jan showed) does the whole layout on a 80 column terminal for you! This functionality is going, going gone in Drizzle and won’t be coming back.

There’s also an “empty record on start of formfile” (see make_empty_rec in unireg.cc). This bit is going to cause me some pain relatively soon. Not so much for writing something like it out (default values can be easily put in the proto) but by then constructing it on open (with some careful footing around the issue of the egg coming before the chicken).

Incidently, when discussing with Daniel Stone about this (and explaining all the weirdness) it did cause him to exclaim “omg, it’s XKB!” – so that probably helps the X hackers in the room to relate.

The biggest test in moving from FRM to proto is to only rewrite this part of the code – the TABLE_SHARE, field, Create_foo etc have sooo many bits I want to change/fix. Going down the rat-hole into an endless cycle of fixing is always a possibility. Sometimes (like with unireg_type) the cleanup lets me really discover what the code is doing, so that’s being done (but will go away “soon”).