nocache LD_PRELOAD

Want to do something like “cp big_file copy_of_big_file” or “tar xfz big_tarball.tar.gz” but without thrashing your cache?

Enrico Zini has a nice little LD_PRELOAD called nocache.

$ nocache tar xfz foo.tar.gz

Goes well with libeatmydata. A pair of tools for compensating for your Operating System casually hating you.

I imagine people will love this when taking database backups.

Using the row buffer in Drizzle (and MySQL)

Here’s another bit of the API you may need to use in your storage engine (it also seems to be a rather unknown. I believe the only place where this has really been documented is ha_ndbcluster.cc, so here goes….

Drizzle (through inheritance from MySQL) has its own (in memory) row format (it could be said that it has several, but we’ll ignore that for the moment for sanity). This is used inside the server for a number of things. When writing a Storage Engine all you really need to know is that you’re expected to write these into your engine and return them from your engine.

The row buffer format itself is kind-of documented (in that it’s mentioned in the MySQL Internals documentation) but everywhere that’s ever pointed to makes the (big) assumption that you’re going to be implementing an engine that just uses a more compact variant of the in-memory row format. The notable exception is the CSV engine, which only ever cares about textual representations of data (calling val_str() on a Field is pretty simple).

The basic layout is a NULL bitmap plus the data for each non-null column:

Except that the NULL bitmap is byte aligned. So in the above diagram, with four nullable columns, it would actually be padded out to 1 byte:

Each column is stored in a type-specific way.

Each Table (an instance of an open table which a Cursor is used to iterate over parts of) has two row buffers in it: record[0] and record[1]. For the most part, the Cursor implementation for your Storage Engine only ever has to deal with record[0]. However, sometimes you may be asked to read a row into record[1], so your engine must deal with that too.

A Row (no, there’s no object for that… you just get a pointer to somewhere in memory) is made up of Fields (as in Field objects). It’s really made up of lots of things, but if you’re dealing with the row format, a row is made up of fields. The Field objects let you get the value out of a row in a number of ways. For an integer column, you can call Field::val_int() to get the value as an integer, or you can call val_str() to get it as a string (this is what the CSV engine does, just calls val_str() on each Field).

The Field objects are not part of a row in any way. They instead have a pointer to record[0] stored in them. This doesn’t help you if you need to access record[1] (because that can be passed into your Cursor methods). Although the buffer passed into various Cursor methods is usually record[0] it is not always record[0]. How do you use the Field objects to access fields in the row buffer then? The answer is the Field::move_field_offset(ptrdiff_t) method. Here is how you can use it in your code:

ptrdiff_t row_offset= buf - table->record[0];
(**field).move_field_offset(row_offset);
(do things with field)
(**field).move_field_offset(-row_offset);

Yes, this API completely sucks and is very easy to misuse and abuse – especially in error handling cases. We’re currently discussing some alternatives for Drizzle.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Exporting a set of bzr revisions as a quilt series

There has to be a better way than this… but it does work (at least for revisions 11 through 141):

for rev in `seq 11 141`;
do
if [ -z "`bzr diff -r\`expr $rev - 1\`..$rev|diffstat -p0 -l|grep ^tests`" ];
then
(bzr log -r$rev --forward --log-format=long
| sed -e 's/^  //;
/^------------------------------------------------------------/d;
/^revno:.*$/d; /^committer:.*/d; /^branch nick:/d;
/^timestamp: /d; /^message:/d';
echo;
echo;
bzr diff -r`expr $rev - 1`..$rev --prefix a/storage/innodb_plugin/:b/storage/innodb_plugin/) > patches/$rev.patch ;
echo $rev.patch >> patches/series;
fi;
done

Developing my own film

dedicated bench, originally uploaded by macplusg3.

This is from the first film I’ve ever developed myself. I know a lot of people who’ve done this in school or something, but I never did.. so it’s just me, teaching myself (and playing with chemicals).

This was shot one day when I went out riding down in Black Rock (not too far from home). There’s something about benches dedicated to people that just twinges something in my brain… How do you get to the point where you think a great way to remember someone is to have a plaque on a bench? Carrying a camera while bike riding is quite useful sometimes.

Shot on Lucky B&W SHD100 film on at early 1970s Canon rangefinder.

desktop-couch has been nothing but suck

$ du -sh /home/stewart/.cache/desktop-couch/desktop-couchdb.*
746M	/home/stewart/.cache/desktop-couch/desktop-couchdb.log
4.0K	/home/stewart/.cache/desktop-couch/desktop-couchdb.pid
16K	/home/stewart/.cache/desktop-couch/desktop-couchdb.stderr
653M	/home/stewart/.cache/desktop-couch/desktop-couchdb.stdout

$ du -sh /home/stewart/.local/share/desktop-couch/.gwibber_messages_design/2f3267703246f5e02533e59714915b7d.view 
436M	/home/stewart/.local/share/desktop-couch/.gwibber_messages_design/2f3267703246f5e02533e59714915b7d.view

I feel better already. I think the log files irritate me the most.

HailDB, Hudson, compiler warnings and cppcheck

I’ve integrated HailDB into our Hudson setup (haildb-trunk on Hudson). I’ve also made sure that Hudson is tracking the compiler warnings. We’ve enabled more compiler warnings than InnoDB has traditionally been compiled with – this means we’ve started off with over 4,300 compiler warnings! Most of those are not going to be anything remotely harmful – however, we often find that it’s 1 in 1000 that is a real bug. I’ve managed to get it down to about 1,700 at the moment (removing a lot of harmless ones).

I’ve also enabled a cppcheck run on it. Cppcheck is a static analysis tool for C/C++. We’ve also enabled it for Drizzle (see drizzle-build-cppcheck on Hudson). When we enabled it for Drizzle, we immediately found three real bugs! There is also a coding style checker which we’ve also enabled on both projects. So far, cppcheck has not found any real bugs in HailDB, just some style warnings.

So, I encourage you to try cppcheck if you’re writing C/C++.

The rotating blades database benchmark

(and before you ask, yes “rotating blades” comes from “become a fan”)

I’m forming the ideas here first and then we can go and implement it. Feedback is much appreciated.

Two tables.

Table one looks like this:

CREATE TABLE fan_of (
user_id BIGINT,
item_id BIGINT,
PRIMARY KEY (user_id, item_id),
INDEX (item_id)
);

That is, two columns, both 64bit integers. The primary key covers both columns (a user cannot be a fan of something more than once) and can be used to look up all things the user is a fan of. There is also an index over item_id so that you can find out which users are a fan of an item.

The second table looks like this:

CREATE TABLE fan_count (
item_id BIGINT PRIMARY KEY,
fans BIGINT
);

Both tables start empty.

You will have 1000, 2000,4000 and 8000 concurrent clients attempting to run the queries. These concurrent clients must behave as if they could be coming from a web server. The spirit of the benchmark is to have 8000 threads (or processes) talk to the database server independent of each other.

The following set of queries will be run a total of 23,000,000 (twenty three million) times. The my_user_id below is an incrementing ID per connection allocated by partitioning 23,000,000 evenly between all the concurrent clients (e.g. for 1000 connections each connection gets 23,000 sequential ids)

You must run the following queries.

  • How many fans are there of item 12345678 (e.g. SELECT fans FROM fan_count WHERE item_id=12345678)
  • Is my_user_id already a fan of item 12345678 (e.g. SELECT user_id FROM fan_of WHERE user_id=my_user_id AND item_id=12345678)
  • The next two queries MUST be in the same transaction:
    • my_user_id becomes a fan of item 12345678 (e.g. INSERT INTO fans (user_id,item_id) values (my_user_id, 12345678))
    • increment count of fans (e.g. UPDATE fan_count SET fans=fans+1 WHERE item_id=12345678)

For the first query you are allowed to use a caching layer (such as memcached) but the expiry time must be 5 seconds or less.

You do not have to use SQL. You must however obey the transaction boundary above. The insert and the update must be part of the same transaction.

Results should include: min, avg, max response time for each query as well as the total time to execute the benchmark.

Data must be durable to a machine being switched off and must still be available with that machine switched off. If committing to local disk, you must also replicate to another machine. If running asynchronous replication, the clock does not stop until all changes have been applied on the slave. If doing asynchronous replication, you must also record the replication delay throughout the entire test.

In the event of timeout or deadlock in doing the insert and update part, you must go back to the first query (how many fans) and retry. Having to retry does not count towards the 23,000,000 runs.

At the end of the benchmark, the query SELECT fans FROM fan_count WHERE item_id=12345678 should return 23,000,000.

Yes, this is a very evil benchmark. It seems to be a bit indicative about the kind of peak load that can be experienced by a bunch of Web 2.0 sites that have a “like” or “become a fan” style buttons. I fully expect the following:

  • Pretty much all systems will nosedive in performance after 1000 concurrent clients
  • Transaction rollbacks due to deadlock detection or lock wait timeouts will be a lot.
  • Many existing systems and setups not complete it in reasonable time.
  • A solution using Scale Stack to be an early winner (backed by MySQL or Drizzle)
  • Somebody influenced by Domas turning InnoDB deadlock detection off very quickly.
  • Somebody to call this benchmark “stupid” (that person will have a system that fails dismally at this benchmark)
  • Somebody who actually has any knowledge of modern large scale web apps to suggest improvements
  • Nobody even attempting to benchmark the Oracle database
  • Somebody submitting results with MySQL to not wait until the replication stream has finished applying.
  • Some NoSQL systems to suck considerably more than their SQL counterparts.

Storage Engine API: write_row, CREATE SELECT and DDL

(this probably applies exactly the same for MySQL and Drizzle… but I’m just speaking about current Drizzle here)

In my current merge request for the embedded-innodb-create-select-transaction-arrgh branch (also see this specific revision), you’ll notice an odd hoop that we have to jump through to make CREATE SELECT statements work with an engine such as InnoDB.

Basically, this is what happens:

  • start transaction
  • start executing SELECT QUERY (well, prepare executing it and fetch a row)
  • create table
  • attempt to insert into table

But… we have to do the DDL statement (i.e. the CREATE TABLE) in its own transaction. This means that the outer transaction (running the SELECT) shouldn’t be able to see it. Except it does. We can create a cursor on this table. However, when we try and do something with it (e.g. ib_cursor_first()) we then get the error message DB_MISSING_HISTORY from InnoDB. With a data dictionary that was REPEATABLE READ, we shouldn’t have this problem. However, we don’t have that.

So? What do we do? If we’re in ::write_row and we get an error and we’re running a SQLCOM_CREATE_TABLE sql_command (yes, we get to poke into current_session->lex->sql_command to find this out) we just magically restart the transaction so that we can (properly) see the created table and write rows to it.

This is not a sane part of the interface; it won’t be an issue for many engines but it is needed here.

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Interesting Videos from the MySQL Conference and Expo

There’s a good number of videos appearing online from the MySQL Conference and Expo that was on last week.

Here’s a short list of interesting things to look at if you weren’t able to make the sessions. Obviously, this is from my view as a Drizzle developer. There were other interesting things, but this list is more focused towards where my Drizzle brain is stimulated.

Announcing HailDB

I just announced our continuation of the Embedded InnoDB project under the name of HailDB. Check out the announcement over at http://www.haildb.com/.

HailDB is a relational database that is embeddable within applications. You embed HailDB by linking to a shared library and calling a clean and simple API. HailDB is a continuation of the Embedded InnoDB project. It is not itself a database server, but is a library implementing the storage layer. With the addition of the HailDB plugin to Drizzle you get a full SQL interface.

Read more at http://www.haildb.com

Embedded InnoDB is in the tree!

Well… the start of it :)

I’ve taken the approach of taking tiny incremental steps (and getting review for each step) in implementing a Storage Engine based on the Embedded InnoDB library. What hit lp:drizzle (the trunk branch, for the 2010-04-07 milestone tarball) is only a handful of these small steps, so this engine is not remotely ready for end users.

There should be more of my Embedded InnoDB work hitting the tree in the upcoming days/weeks, enough to get it to a satte that one could describe as functional :)

AlsoSQL

So there’s a bit of a swelling around the idea of NoSQL. That is, databases that don’t have an SQL interface in front of them – with the promise of better performance. With a well designed backend, this is no doubt the case.

A flexible query language is rather useful though. I think we’ll see the rise of AlsoSQL. That is systems that present a fast and simple protocol along with a SQL interface.

This hybrid system has seen use for many years. MySQL Cluster is one such example. SQL through MySQL Server, NoSQL through NDB API.

With Drizzle, I feel we’ll be in a pretty good position to offer non-sql based protocols and access methods to existing storage engines.

The Drizzle (and MySQL) Key tuple format

Here’s something that’s not really documented anywhere (unless you count ha_innodb.cc as a source of server documentation). You may have some idea about the MySQL/Drizzle row buffer format. This is passed around the storage engine interface: in for write_row and update_row and out for the various scan and index read methods.

If you want to see the docs for it that exist in the code, check out store_key_val_for_row in ha_innodb.cc.

However, there is another format that is passed to your engine (and that your engine is expected to understand) and for lack of a better name, I’m going to call it the key tuple format. The first place you’ll probably see this is when implementing the index_read function for a Cursor (or handler in MySQL speak).

You get two things: a pointer to the buffer and the length of the buffer. Since a key can be made up of multiple parts, some of which can be NULL and some of which can be of variable length, this buffer is not (usually) a simple value. If you are starting out in your engine development, you can use this buffer blindly as a single value for non-nullable indexes with only 1 column.

The basic format is this:

  • The buffer is in-order of the index. First column in the index is first in the buffer, second second etc.
  • The buffer must be zero-filled. The server kernel will use memcmp to compare two key values.
  • If the column is NULLable, then the first byte is set to 1 if the column is null. Else, 0 means not-null.
  • From ha_innodb.cc (for BLOBs, which I haven’t put in embedded_innodb yet): If the column is of a BLOB type (it must be a column prefix field in this case), then we put the length of the data in the field to the next 2 bytes, in the little-endian format. If the field is SQL NULL, then these 2 bytes are set to 0. Note that the length of data in the field is <= column prefix length.
  • For fixed length fields (such as int), the next max field length bytes are for that field.
  • For VARCHAR, there is always a 2 byte (in little endian) length. This is different to the row format, which may have 1 or 2 bytes. In the key tuple format it is ALWAYS two bytes.

I’ll discuss the use of this for rnd_pos() and position() in a later post…

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).

Storing the table message in Embedded InnoDB

One of the exciting things[1] about working on a storage engine in Drizzle is that you get to manage your own metadata. When the database engine you’re writing the storage engine interface for has a pretty complete data dictionary (e.g. Embedded InnoDB) you could just directly use it. At some point I plan to do this for the embedded_innodb engine for Drizzle so that you could just point Drizzle at an existing Embedded InnoDB database and run SQL queries on it.

The Drizzle table message does have some things in it that aren’t in the InnoDB data dictionary though (e.g. table and column comments). We want to preserve these (and also things like there may be several data types in Drizzle that map to the same data type in InnoDB). Since the Embedded InnoDB API allows us to do things within the DDL transaction (such as insert a row into a table), we store the serialized table message in a table as part of the DDL transaction. This means we can have fully crash safe DDL! There is no way the table definition can get out of sync with what is in InnoDB; we are manipulating them both in the same transaction!

The table structure we’re using is pretty simple. There is two columns: table_name VARCHAR(IB_MAX_TABLE_NAME_LEN) and message BLOB.

The operations we need are:

  • store the table message in doCreateTable (INSERT)
  • rename the table message in doRenameTable (UPDATE the table_name column)
  • delete the table message in doDropTable (DELETE)
  • list tables in a database (SELECT with prefix)
  • get table message (SELECT using key lookup)

All of which are pretty easy to implement using the Embedded InnoDB API.

[1] Maybe I need to get out more….

Finding Ada

Ada Lovelace Day is an international day of blogging to celebrate the achievements of women in technology and science.

- http://findingada.com/

This is something I had wanted to do last year… and I’m finding I have the same problem this year. My idea was to write about someone who has had an influence on me. The problem is picking one person to write about. Throughout my life there have been many women in technology who have influenced me. I started going through people in my head… and got to a very long list rather quickly.

So, instead, I shall write about the future.

To the future Ada, who will think this whole exercise of picking a woman in technology to write about as absurd as if we, today, picked a woman who votes to write about.

on TableIdentifier (and the death of path as a parameter to StorageEngines)

As anybody who has ever implemented a Storage Engine for MySQL will know, a bunch of the DDL calls got passed a parameter named “path”. This was a filesystem path. Depending on what platform you were running, it may contain / or \ (and no, it’s not consistent on each platform). Add to that the difference if you were creating temporary tables (table name of #sql_somethingsomething) and the difference if you were one of the two (built in) engines that were able to be used for creating internal temporary tables (temp tables that are created during query execution that do not belong in a schema). Well… you had a bit of a mess.

My earlier attempts involved splitting everything up into two strings: schema name and table name. This ended badly. The final architecture we decided on was to have an object passed around that would deal with various transformations (from what the user entered to what we can store on file systems, or to what temporary table maps to what unique name). This is TableIdentifier.

Brian has been introducing it around the code for a while now, and we just got it to now most of the places where table names are passed to Storage Engines. This means that if you’re writing a Storage Engine that doesn’t just blindly store things in files, you can sensibly use the getSchemaName() and getTableName() methods to call your API.

One last bit of evil….

You can store things for later!
drizzle> select libtcc("#include <string.h>\n#include <stdlib.h>\nint foo(char* s) { char *a= malloc(1000); return snprintf(s, 100, \"%p\", a); }") as RESULT;
+-----------+
| RESULT    |
+-----------+
| 0x199c610 |
+-----------+
1 row in set (0 sec)
drizzle> select libtcc("#include <string.h>\n#include <stdlib.h>\nint foo(char* s) { char *a= 0x199c610; strcpy(a, \"Hello World!\"); strcpy(s,\"done\"); return strlen(s); }") as result;
+--------+
| result |
+--------+
| done   |
+--------+
1 row in set (0.01 sec)
drizzle> select libtcc("#include <string.h>\n#include <stdlib.h>\nint foo(char* s) { char *a= 0x199c610; strcpy(s, a); return strlen(s); }") as result;
+--------------+
| result       |
+--------------+
| Hello World! |
+--------------+
1 row in set (0.01 sec)
And then… i can disconnect, reconnect, or whatever (as for any of the above really) before cleaning up my memory:
drizzle> select libtcc("#include <string.h>\n#include <stdlib.h>\nint foo(char* s) { char *a= 0x19a9bc0; free(a); strcpy(s,\"done\"); return strlen(s); }") as result;
+--------+
| result |
+--------+
| done   |
+--------+
1 row in set (0 sec)