A warning to Solaris users…. (fsync possibly doesn’t)

Read the following:

Linux has its fair share of dumb things with data too (ext3 not defaulting to using write barriers is a good one). This is however particularly nasty… I’d have really hoped there were some good tests in place for this.

This should also be a good warning to anybody implementing advanced storage systems: we database guys really do want to be able to write things reliably and you really need to make sure this works.

So, Stewart’s current list of stupid shit you have to do to ensure a 1MB disk write goes to disk in a portable way:

  • You’re a database, so you’re using O_DIRECT
  • Use < 32k disk writes
  • fsync()
  • write 32-64mb of sequential data to hopefully force everything out of the drive write cache and onto the platter to survive power failure (because barriers may not be on). Increase this based on whatever caching system happens to be in place. If you think there may be battery backed RAID… maybe 1GB or 2GB of data writes
  • If you’re extending the file, don’t bother… that especially seems to be buggy. Create a new file instead.

Of course you could just assume that the OS kind of gets it right…. *laugh*

nocache LD_PRELOAD

Want to do something like “cp big_file copy_of_big_file” or “tar xfz big_tarball.tar.gz” but without thrashing your cache?

Enrico Zini has a nice little LD_PRELOAD called nocache.

$ nocache tar xfz foo.tar.gz

Goes well with libeatmydata. A pair of tools for compensating for your Operating System casually hating you.

I imagine people will love this when taking database backups.

Exporting a set of bzr revisions as a quilt series

There has to be a better way than this… but it does work (at least for revisions 11 through 141):

for rev in `seq 11 141`;
do
if [ -z "`bzr diff -r\`expr $rev - 1\`..$rev|diffstat -p0 -l|grep ^tests`" ];
then
(bzr log -r$rev --forward --log-format=long
| sed -e 's/^  //;
/^------------------------------------------------------------/d;
/^revno:.*$/d; /^committer:.*/d; /^branch nick:/d;
/^timestamp: /d; /^message:/d';
echo;
echo;
bzr diff -r`expr $rev - 1`..$rev --prefix a/storage/innodb_plugin/:b/storage/innodb_plugin/) > patches/$rev.patch ;
echo $rev.patch >> patches/series;
fi;
done

Developing my own film

dedicated bench, originally uploaded by macplusg3.

This is from the first film I’ve ever developed myself. I know a lot of people who’ve done this in school or something, but I never did.. so it’s just me, teaching myself (and playing with chemicals).

This was shot one day when I went out riding down in Black Rock (not too far from home). There’s something about benches dedicated to people that just twinges something in my brain… How do you get to the point where you think a great way to remember someone is to have a plaque on a bench? Carrying a camera while bike riding is quite useful sometimes.

Shot on Lucky B&W SHD100 film on at early 1970s Canon rangefinder.

desktop-couch has been nothing but suck

$ du -sh /home/stewart/.cache/desktop-couch/desktop-couchdb.*
746M	/home/stewart/.cache/desktop-couch/desktop-couchdb.log
4.0K	/home/stewart/.cache/desktop-couch/desktop-couchdb.pid
16K	/home/stewart/.cache/desktop-couch/desktop-couchdb.stderr
653M	/home/stewart/.cache/desktop-couch/desktop-couchdb.stdout

$ du -sh /home/stewart/.local/share/desktop-couch/.gwibber_messages_design/2f3267703246f5e02533e59714915b7d.view 
436M	/home/stewart/.local/share/desktop-couch/.gwibber_messages_design/2f3267703246f5e02533e59714915b7d.view

I feel better already. I think the log files irritate me the most.

The Drizzle (and MySQL) Key tuple format

Here’s something that’s not really documented anywhere (unless you count ha_innodb.cc as a source of server documentation). You may have some idea about the MySQL/Drizzle row buffer format. This is passed around the storage engine interface: in for write_row and update_row and out for the various scan and index read methods.

If you want to see the docs for it that exist in the code, check out store_key_val_for_row in ha_innodb.cc.

However, there is another format that is passed to your engine (and that your engine is expected to understand) and for lack of a better name, I’m going to call it the key tuple format. The first place you’ll probably see this is when implementing the index_read function for a Cursor (or handler in MySQL speak).

You get two things: a pointer to the buffer and the length of the buffer. Since a key can be made up of multiple parts, some of which can be NULL and some of which can be of variable length, this buffer is not (usually) a simple value. If you are starting out in your engine development, you can use this buffer blindly as a single value for non-nullable indexes with only 1 column.

The basic format is this:

  • The buffer is in-order of the index. First column in the index is first in the buffer, second second etc.
  • The buffer must be zero-filled. The server kernel will use memcmp to compare two key values.
  • If the column is NULLable, then the first byte is set to 1 if the column is null. Else, 0 means not-null.
  • From ha_innodb.cc (for BLOBs, which I haven’t put in embedded_innodb yet): If the column is of a BLOB type (it must be a column prefix field in this case), then we put the length of the data in the field to the next 2 bytes, in the little-endian format. If the field is SQL NULL, then these 2 bytes are set to 0. Note that the length of data in the field is <= column prefix length.
  • For fixed length fields (such as int), the next max field length bytes are for that field.
  • For VARCHAR, there is always a 2 byte (in little endian) length. This is different to the row format, which may have 1 or 2 bytes. In the key tuple format it is ALWAYS two bytes.

I’ll discuss the use of this for rnd_pos() and position() in a later post…

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Writing A Storage Engine for Drizzle, Part 2: CREATE TABLE

The DDL code paths for Drizzle are increasingly different from MySQL. For example, the embedded_innodb StorageEngine CREATE TABLE code path is completely different than what it would have to be for MySQL. This is because of a number of reasons, the primary one being that Drizzle uses a protobuf message to describe the table format instead of several data structures and a FRM file.

We are pretty close to having the table protobuf message format being final (there’s a few bits left to clean up, but expect them done Real Soon Now (TM)). You can see the definition (which is pretty simple to follow) in drizzled/message/table.proto. Also check out my series of blog posts on the table message (more posts coming, I promise!).

Drizzle allows either your StorageEngine or the Drizzle kernel to take care of storage of table metadata. You tell the Drizzle kernel that your engine will take care of metadata itself by specifying HTON_HAS_DATA_DICTIONARY to the StorageEngine constructor. If you don’t specify HTON_HAS_DATA_DICTIONARY, the Drizzle kernel stores the serialized Table protobuf message in a “table_name.dfe” file in a directory named after the database. If you have specified that you have a data dictionary, you’ll also have to implement some other methods in your StorageEngine. We’ll cover these in a later post.

If you ever dealt with creating a table in MySQL, you may recognize this method:

virtual int create(const char *name, TABLE *form, HA_CREATE_INFO *info)=0;

This is not how we do things in Drizzle. We now have this function in StorageEngine that you have to implement:

int doCreateTable(Session* session, const char *path,
                  Table& table_obj,
                  drizzled::message::Table& table_message)

The existence of the Table parameter is largely historic and at some point will go away. In the Embedded InnoDB engine, we don’t use the Table parameter at all. Shortly we’ll also get rid of the path parameter, instead having the table schema in the Table message and helper functions to construct path names.

Methods name “doFoo” (such as doCreateTable) mean that there is a method named foo() (such as createTable()) in the base class. It does some base work (such as making sure the table_message is filled out and handling any errors) while the “real” work is done by your StorageEngine in the doCreateTable() method.

The Embedded InnoDB engine goes through the table message and constructs a data structure for the Embedded InnoDB library to create a table. The ARCHIVE storage engine is much simpler, and it pretty much just creates the header of the ARZ file, mostly ignoring the format of the table. The best bet is to look at the code from one of these engines, depending on what type of engine you’re working on. This code, along with the table message definition should be more than enough.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Bike Riding in the storm

Out on a pier down St Kilda, the weather looked… well… like it could be a bit annoying on the way back:

but then… just a bit down the way…. it hit:

It was “a bit wet”. Big blocks of ice falling from the sky (that hurt).

Anyway, on the way back we found a storm water drain:

Yes, behind Michael is just all water (and I’m not talking about the Bay).

Still managed to get a 36.5km ride out of it, so not all bad.

Writing A Storage Engine for Drizzle, Part 1: Plugin basics

So, you’ve decided to write a Storage Engine for Drizzle. This is excellent news! The API is continually being improved and if you’ve worked on a Storage Engine for MySQL, you’ll notice quite a few differences in some areas.

The first step is to create a skeleton StorageEngine plugin.

You can see my skeleton embedded_innodb StorageEngine plugin in its merge request.

The important steps are:

1. Create the plugin directory

e.g. mkdir plugin/embedded_innodb

2. Create the plugin.ini file describing the plugin

create the plugin.ini file in the plugin directory (so it’s plugin/plugin_name/plugin.ini)
An example plugin.ini for embedded_innodb is.

[plugin]
title=InnoDB Storage Engine using the Embedded InnoDB library
description=Work in progress engine using libinnodb instead of including it in tree.
sources=embedded_innodb_engine.cc
headers=embedded_innodb_engine.h

This gives us a title and description, along with telling the build system what sources to compile and what headers to make sure to include in any source distribution.

3. Add plugin dependencies

Your plugin may require extra libraries on the system. For example, the embedded_innodb plugin uses the Embedded InnoDB library (libinnodb).

Other examples include the MD5 function requiring either openssl or gnutls, the gearman related plugins requiring gearman libraries, the UUID() function requiring libuuid and BlitzDB requiring Tokyo Cabinet libraries.

For embedded_innodb, pandora-build has a macro for finding libinnodb on the system. We want to run this configure check, so we create a plugin.ac file in the plugin directory (i.e. plugin/plugin_name/plugin.ac) and add the check to it.

For embedded_innodb, the plugin.ac file just contains this one line:

PANDORA_HAVE_LIBINNODB

We also want to add two things to plugin.ini; one to tell the build system only to build our plugin if libinnodb was found and the other to link our plugin with libinnodb. For embedded_innodb, it’s these two lines:

build_conditional="x${ac_cv_libinnodb}" = "xyes"
ldflags=${LTLIBINNODB}
Not too hard at all! This should look relatively familiar for those who have seen autoconf and automake in the past.

Some plugins (such as the md5 function) have a bit more custom auto-foo in plugin.ini and plugin.ac (as one of two libraries can be used). You can do pretty much anything with the plugin system, but you’re a lot more likely to keep it simple like we have here.

4. Add skeleton source code for your StorageEngine

While this will change a little bit over time (and is a little long to just paste into here), you can see what I did for embedded_innodb in the skeleton-embedded-innodb-engine tree.

5. Build!

You will need to re-run ./config/autorun.sh so the build system picks up your new plugin. When you run ./configure --help afterwards, you should see options for building with/without your new plugin.

6. Add a test

You will probably want to add a test to see that your plugin loads successfully. When your plugin is built, the test suite automatically picks up any tests you have in the plugin/plugin_name/tests directory. This is in the same format as general MySQL and Drizzle tests: tests go in a t/ directory, expected results in a r/ directory.

Since we are loading a plugin, we will also need some server options to make sure that plugin is loaded. These are stored in the rather inappropriately named test-master.opt file (that’s the test name with “-master.opt” appended to the end instead of “.test“). For the embedded_innodb plugin_load test, we have a plugin/embedded_innodb/tests/t/plugin_load-master.opt file with the following content:

--plugin_add=embedded_innodb

You can have pretty much anything in the plugin_load.test file… if you’re fancy, you’ll have a SELECT query on data_dictionary.plugins to check that the plugin really is there. Be sure to also add a r/plugin_load.result file (My preferred method is to just create an empty result file, run the test suite and examine the rejected output before renaming the .reject file to .result)

Once you’ve added your test, you can run it either by just typing “make test” (which will run the whole test suite), or you can go into the main tests/ directory and run ./test-run.pl --suite=plugin_name (which will just run the tests for your plugin).

7. Check the code in, feel good about self

and you’re done. Well… the start of a Storage Engine plugin is done :)

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Playing with multiple exposure

So, I discovered that my D200 had a built in “multiple exposure” option. While you can do exactly the same thing in GIMP (or Photoshop I guess) a whole lot easier (for one, you get to see what’s gonig on), we had been discussing Holga earlier in the night… so I felt it kind of appropriate to not really see what I was doing.

Leah playing guitar hero, me sitting across the room only slightly distracting her with a camera.

Guitar Hero

Maybe I will end up getting a Holga one of these days… being restricted can be fun.

anti-anti-feature: Windows license stickers

Anti-Anti-Feature: An antifeature that doesn’t actually do what it’s meant to (something you didn’t want in the first place)

My laptop came with a Windows Vista license. An anti-feature in itself – I didn’t want it, have never used it (I run Ubuntu and love freedom).

However, if you try and read the license key off this sticker, it’s increasingly difficult to do so. It’s being worn away. Why? Because it’s on the bottom of the laptop and I’m using it on my lap (so friction rubs it away).

Luckily I don’t run Windows Vista and need to re-install it any time soon.

on presenting

Dilbert.com

This is totally not confined to at-work presentations.

The number of sessions I have sat through that could have taken 5 minutes instead of 20,30,40 or even 60 is amazing. Remember: I have not flown half way around the globe to see you read. I have come to hear a story, to see how conclusions were formed and interact.

Often, the tools are deficient. Powerpoint encourages bad habits (you can use PowerPoint for excellent slide decks too, but ignore the temptations of boring templates, bad effects and dot lists). The dot point list is more often than not your enemy. I (and anybody else in the audience who has learnt to read) can read your dot points faster than you can. While I’m reading, I’m not listening to you. If you spoke a cure for all forms of cancer just after having put a slide up filled with dot points… 90% of people will miss it.

Now, dot points are an excellent way to remind you what the heck you’re meant to be talking about (and in what order). Use presenter notes! They are really useful.

If your laptop/presentation software doesn’t support a “presenter” mode that lets you view presenter notes but not the whole room, simply write them down, print them out, or anything like that. One simple practice run through will make you be able to do this seamlessly.

The last couple of presentations I did were completely assembled using 280slides.com. An excellent web app for doing presentations. It will import and export ODF (and other formats) so you’re not tied to a (unfortunately) non open source web app. That being said, it ran fine in my browser and unlike OpenOffice.org, did not make me want to stab people repeatedly every time I used it.

So, Stewart’s quick tips:

  • Tell a story. How did you get to your conclusions?
  • Don’t just read. Use visuals to accompany the talk. Visuals aren’t the talk.
  • Practice. Just once or twice through will make things a lot smoother.

Equipment:

  • Make sure your equipment works beforehand. Nobody wants to see you fiddle around with your Windows/OSX laptop only to find out you didn’t bring the dongle or can’t operate the Displays control panel. (Interestingly enough, I see Linux “just work” more than Windows or OSX these days).
  • If there is a microphone, use it. I don’t want to struggle to hear you.
  • If you are constantly using a laser pointer you either have too much on your slides or the slide does not highlight the important information. (laser pointers are useful when people ask questions though)

One blog I love on the subject is Presentation Zen. I’ll also recommend the book, but you can get so much just from the web site.

Some excellent recent presentations:

  • Simplicity Through OptimizationPaul McKenney
    Paul is able to explain RCU clearly and concisely through visuals. You are left with no doubt that this does really work. The visuals are not everything, they assist in the telling of the RCU story
  • Teach every child about foodJamie Oliver
    I watched this online. Note how not everything was smooth the whole way. Also note how this was still effective. Passion is an awesome tool. Check out the simple graph showing lead causes of death: simple and effective.
  • Bill Gates on energy: Innovating to Zero!
    Historically, Bill Gates has not been the most engaging speaker. We can all forget the horrible PowerPoint slides with four hundred dot points about some release of something that nobody cared about. This is different. Clear, concise, engaging and simple visuals to make the point.

First roll through the Nikon F80

A little while ago I bit the bullet and bought a Nikon film body – a F80. May as well have a film body that’s a bit automatic and takes the same lens mount as my digital.

So, I got it and thought “hrrm… I better run a roll of film through it to make sure it works”. Off to the fridge i went to find the cheapest, shittiest roll of film possible… I found “Walgreens” brand film. Manufactured by one of many, bought for cheap, and run through the F80.

Some shots turned out pretty good. I have the full set (most of the roll) up on flickr. A few choice ones are:

Which due to some nice accident of lighting, turned out pretty good. IIRC this was pretty late at night and I was editing photos as Michael came over (bringing much needed beer).

Slides and beer, do you need anything else? I just like this because it’s a snapshot of what I was working on (well, kinda, I was mostly just manipulating digital images).

Leah and I went bushwalking… so had to snap a shot of her. I do like the Nikon 50mm as a portrait lens. The film… well… it was cheap, but not too bad actually.

A shallow depth of field can be a lot of fun. Although not entirely sure how I feel about the bokeh….

Which has some odd colours. Nice, but odd.

I like my “new” body. It’ll be fun.

Anti-anti-features: region coding

DVD anti-features are rather well documented. The purpose of “region coding” was to make sure that everybody who ever visited a foreign country and picked up some DVDs while there would get home to find out that they wouldn’t work.

Luckily, those of us who pay good money for DVDs have free software solutions to let us used our payed for product and not force us to download “pirated” copies just so we can view what we payed for.

The region coding in DVDs was designed with the idea that DVD players would always be expensive. You could “change” which region your DVD player was in a set number of times before you could no longer change it.

DVD players can now be bought for $30 (or less). This is what you could pay for a DVD movie. So with economies of scale driving prices down, even if CSS wasn’t completely broken, you can brute force the region coding by just buying 6 DVD players ($180) – less than many of us payed for our first, second or third DVD player.

The same thing will happen with BluRay. You can now get BluRay players for a couple of hundred dollars. One for each of the regions (A, B and C) will cost you less than original BluRay players cost.

So the antifeature of limiting who can watch a DVD/BluRay release is easily broken as player costs come down.

Anti-anti-features: copyright notices

Mako has often talked very well about anti-features. The “features” in software that nobody wants and often cost money to do the easier task of not including the feature. Examples include the non-skip parts of DVDs and BluRay Discs (see here for more).

I’d like to coin a new term… anti-anti-features. These are antifeatures (i.e. a feature you didn’t want in the first place) doesn’t actually function properly itself.

The other day, I sat down with a friend to watch a movie. We had hired out a BluRay of a recently released movie, popped it in the player and attempted to hit “Pause”. Why pause? Well… movies often can auto-play and we wanted to fetch a beer, snack and otherwise prepare for the great movie watching experience.

It turns out you cannot pause the copyright notice. So if you’re trying to be good and understand your obligations under the license in which you have received this disc, you cannot actually finish reading them!

Try it – put in a DVD or BluRay and try to read the copyright notice. I bet you that for a large number of discs you cannot do so in the time allowed.

This just goes to show how utterly useless these “no skip” zones are. You will see hundreds of exactly the same notice (one for each disc you view) many, many times (each time you view it) – one would think that after the first, second, third or even 10th time you’d understand it.

Amazingly, under DVD playback software that lets you skip the “no skip” zones (e.g. every DVD player on Linux) it also allows you to pause on the copyright notice and read it.

NDB$INFO with SQL hits beta

Bernhard blogged over at http://ocklin.blogspot.com/2010/02/mysql-cluster-711-is-there.html that MySQL Cluster 7.1.1 Beta has been released. The big feature (from my point of view) is the SQL interface on top of NDB$INFO. This means there is now full infrastructure from the NDB data nodes right out to SQL in the MySQL Server for adding monitoring to any bit of the internals of the data nodes.

You have already lost

When the following code introduces a valgrind warning… you are in a world of pain and loss:

=== modified file 'drizzled/field/blob.h'
--- drizzled/field/blob.h	2009-12-21 08:16:13 +0000
+++ drizzled/field/blob.h	2010-01-18 01:36:48 +0000
@@ -32,6 +32,7 @@
  */
 class Field_blob :public Field_str {
 protected:
+  uint32_t assassass;
   uint32_t packlength;
   String value;				// For temporaries
 public: