getting rid of duplicate emails, elegantly

I like duplicate emails in the way that everybody is thinking. This is different.

Due to a bug in offlineimap i hit a little while ago, it’s managed to make copies (sometimes even two copies) of each email in certain folders. Now, this isn’t so bad as

a) email didn’t get lost

b) it’s just using extra disk, and disk is cheap.

but it is annoying when searching.

It’s also annoying because it’s decided to do this on folders such as INBOX/MySQL/bugs which contains an email for each change to a bug report even since I joined the company. That adds up to a lot of wasted inodes and disk blocks.

So, I’ve revived this project that I have in the back of my head of efficiently storing email in a database and being able to sync between instances of it.

This gives us some nice advantages. you can use replication to keep a backup of your email. You can put it in Cluster and have high availability email.

We can also do some neat tricks with tables of all that info that you need to display lists of emails and probably get performance boosts instead of having to open each mail as we currently do. i.e. current email solutions don’t scale to a million emails in a folder.

Partitioning will also be useful to make searches quicker (odds are what we’re searching for is recent and all sorts of foo).

Anyway…. it’s interesting to see the bunch of errors that gets thrown up by the Mail::Box perl module on some of my Maildirs. Hrrm… I may have to resort to my own more error tolerant code. I’m determined to write scripts that can not possibly loose anything.

crash halfway through upgrading Ubuntu breezy to dapper

see BUG#37430 for some details. Also see BUG#37435 for why it gets really painful later on.
Basically, if your machine crashes around the time of the dist-upgrade, you’re totally screwed. mkfs and re-install.

I’d hate to have not made /home a different partition from /.

I currently don’t have much confidence in an easy upgrade for my laptop considering what I’ve just gone through for my desktop. I’m now downloading the flight 5 install CD so i can re-install from stractch. urgh. not happy jan.

My recommendation for upgrade is a full backup of everything beforehand and if anything even remotely goes wrong, restore the backup.

Naturally I didn’t do this for the desktop. but hey, the laptop is more critical. I’ll be doing it for that, as I always do.

MySQL Users Conference 2006 – April 24-27, 2006 – Santa Clara, CA

MySQL Users Conference 2006 – April 24-27, 2006 – Santa Clara, CA

I’m presenting “MySQL Cluster: New Features and Enhancements”. It’s going to be a riot. Lots of sexy new stuff to go on about. I’ll also be running a Cluster BoF and am really looking forward to chatting to people about Cluster and what we can do to make it better for our users.

So come along and talk to me while I’m there.

I’ll be sure to use the word “thongs” in the Australian sense as much as possible.

So come along, or I’ll be forced to ask “Where the bloody hell are ya?

MemberDB – A Membership Database 0.4

MemberDB – A Membership Database

I released 0.4 today. A rather long awaited release. No doubt there’ll be bugs and the need for a 0.4.1 or something – but this is relatively bug free and has a bunch of new cool stuff to chew on.

It’s also the first real release to support MySQL 5.0 (previous releases won’t work as MemberDB heavily uses views).

Photo out of the hotel room window

DSCN7380-small.JPG

At the recent MySQL DevConf in Sorrento, Italy, I decided to take a photo out the window of my hotel room. Here it is. I’ve stayed at worse places, just :)

It also looked really similar to this from the breakfast and lunch room.

Pity we were then stuck in the basement working for most of the day. But it rocked. Got through lots of stuff, which is good.

adnarim_abroad making me homesick

adnarim_abroad: I know it’s controversial to say, but i’

I now want to be bumming around Melbourne city during the games. My house is 2mins walk from a  train station that’s less than 30mins to the middle of town. But never spend enough time there.

It’ll be good to get home.

Although i think the tired and hungover thing isn’t helping.

really unstable laptop

I’m currently getting hard crashes about five times a day.

I thought it was the sound driver, as i got a crash during dist-upgrade (again) while on console and saw the backtrace. Basically looked like something bad happenned when the sound was muted.

So, running without sound muted – just turned down.

Well, today, just crashed again. Since running X, no backtrace. ARRRGHHH.

Also crashed when waking up too. ACPI stuff in the backtrace.

Not a happy camper at the moment. I have work to do, not futzing around with trying to find out what the fuck is wrong with my laptop (probably software) when I should be running a stable system.

I’ve already have to re-add all my liferea RSS feeds as liferea obviously isn’t doing the right thing (at least the version shipping with Ubuntu) regards writing the feeds file to disk.

So, I’m trying to prepare presentations for our DevConf on an incredibly buggy and almost unusable OpenOffice.org on an unstable laptop.

I think I’m going to have wine again with lunch.

In Sorrento (and awake!)

(almost) enough said. Good to see people again. Now just a talk to prepare for tomorrow.

Okay, not totally prepare – but a bit of it.

The one I’m giving today – on Cluster Replication is pretty much done. Would like to run through beforehand – but not sure how that plan is going to go.

Correcting some incorrect information about Cluster

While googling this morning, I came across: Arguments in favour of PostgreSQL – MoodleDocs. Unfortunately, there’s some factual errors about both MySQL and MySQL Cluster.

For a start, it seems as though the comparison is Postgresql versus MyISAM. Now, this isn’t exactly a fair comparison when it comes to some loads – especially anything when you want crash safety or transactions.

With InnoDB however, COMMIT means that the transaction is on disk and durable.

So, onto Cluster. The replication between nodes in a cluster is not asynchronous. When a transaction is COMMITTED, it’s in the main memory of more than one data node. I.e. the D for Durability. We are durable across node failure. A failure of the entire cluster can loose you some transactions, but only those since the last Global Checkpoint.A whole cluster failure should be a very rare event (read: never).

The comparison to Postgresql async writes is incorrect as in this case on COMMIT the transaction is not durable (unlike with MySQL Cluster).

A million tables

Arjen’s MySQL Community Journal – A million tables

$ time ~/mysql_create_table_torture
50234

real 6m11.927s
user 0m2.347s
sys 0m1.578s
(i hit ctrl-c at 50,000 as i did want to get back to real work).

No sign of slowdown. Assume it would take about 60mins on XFS. Seems to be metadata limited here… disk going constant, not CPU.

Of course the real benefit with XFS will be sane lookup times.

Delete wasn’t bad -  under 2 mins.

Also would be better on a less abused FS than my laptop :)

OpenOffice.org2 frustrates me like paper cuts

Possibly Ubuntu’s fault too for shipping the not-latest un-bugfixed release. But either way, it’s really annoying seeing the “Document Recovery” screen more than the edit text widget.

Copy and Paste slides in Impress is not flakey – it’s damn right crushed to pieces. Occationally it does something useful – like paste and not crash.

update: yes, i am just using it to put together presentations for our upcoming devconf – as well as the user conference. Why these things are so hard to do is beyond me. A simple app that didn’t crash *cough* magicpoint *cough* is looking rather superior at the moment.

update part 2: yes, the title changed. arguably i like this one better. although adding “right under the fingernail” is tempting

So who instead?

Andrew asks So who instead?

The problem is (as I said in my last post) – I don’t know. I don’t think there are any good contenders. At best there’s “okay”.

Hence the saddness in the Australian political arena.

I think there’s some good people around – but being good seems to take second place to backstabbing and arguing like 3 year olds. Question Time is a great thing to watch on TV – it makes you feel mature – even if you just pulled your sisters’ hair to make her cry (especially if you don’t have a sister).

Adding an INFORMATION_SCHEMA table

I (relatively) recently added the FILES table to INFORMATION_SCHEMA (check out the patch). This was so that we could provide an SQL interface to check how much free space their was in ndb disk data files. You can see how to use this in my post entitled INFORMATION_SCHEMA.FILES (querying disk usage from SQL). We’ll have stuff hitting the manual at some point soon.

So how is an INFORMATION_SCHEMA table implemented in MySQL? Well, the first thing to realise is that INFORMATION_SCHEMA (herein refered to as I_S) tables aren’t “real” tables – they are not created by the user (in fact, the user shouldn’t – or can’t – do DDL on the I_S database). Instead, the I_S database contains tables generated on-the-fly by the MySQL Server. Internally, we create MEMORY tables and convert them to MyISAM if they get too big and then run queries on them.

So, into the code. Bring up sql_show.cc and have a look. Specifically, scroll down to ST_SCHEMA_TABLE schema_tables[] and see that it’s the same as what SHOW TABLES gets you on the I_S database.

For adding the FILES table, you can see from my patch that we just add an item to this structure. (the hunk is below)

@@ -4655,6 +4726,8 @@
fill_schema_column_privileges, 0, 0, -1, -1, 0},
{"ENGINES", engines_fields_info, create_schema_table,
fill_schema_engines, make_old_format, 0, -1, -1, 0},
+  {"FILES", files_fields_info, create_schema_table,
+   fill_schema_files, 0, 0, -1, -1, 0},
   {"KEY_COLUMN_USAGE", key_column_usage_fields_info, create_schema_table,
get_all_tables, 0, get_schema_key_column_usage_record, 4, 5, 0},
{"OPEN_TABLES", open_tables_fields_info, create_schema_table,

As the comment indicates (oh my, a comment that’s correct!) you need to add an item to the enum in sql/table.h for the files table.
We’ll need to create the files_fields_info structure. This, as you guessed, describes the fields in the table. Above the schema_tables[] definition in sql_show.cc you can see the existing definitions for existing tables. Our (rather long) definition for FILES is:

ST_FIELD_INFO files_fields_info[]=
{
{"FILE_ID", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"FILE_NAME", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0},
{"FILE_TYPE", 20, MYSQL_TYPE_STRING, 0, 0, 0},
{"TABLESPACE_NAME", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0},
{"TABLE_CATALOG", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0},
{"TABLE_SCHEMA", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0},
{"TABLE_NAME", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0},
{"LOGFILE_GROUP_NAME", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0},
{"LOGFILE_GROUP_NUMBER", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"ENGINE", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0},
{"FULLTEXT_KEYS", NAME_LEN, MYSQL_TYPE_STRING, 0, 0, 0},
{"DELETED_ROWS", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"UPDATE_COUNT", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"FREE_EXTENTS", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"TOTAL_EXTENTS", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"EXTENT_SIZE", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"INITIAL_SIZE", 8, MYSQL_TYPE_LONGLONG, 0, 0, 0},
{"MAXIMUM_SIZE", 8, MYSQL_TYPE_LONGLONG, 0, 0, 0},
{"AUTOEXTEND_SIZE", 8, MYSQL_TYPE_LONGLONG, 0, 0, 0},
{"CREATION_TIME", 0, MYSQL_TYPE_TIMESTAMP, 0, 0, 0},
{"LAST_UPDATE_TIME", 0, MYSQL_TYPE_TIMESTAMP, 0, 0, 0},
{"LAST_ACCESS_TIME", 0, MYSQL_TYPE_TIMESTAMP, 0, 0, 0},
{"RECOVER_TIME", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"TRANSACTION_COUNTER", 4, MYSQL_TYPE_LONG, 0, 0, 0},
{"VERSION", 21 , MYSQL_TYPE_LONG, 0, 1, "Version"},
{"ROW_FORMAT", 10, MYSQL_TYPE_STRING, 0, 1, "Row_format"},
{"TABLE_ROWS", 21 , MYSQL_TYPE_LONG, 0, 1, "Rows"},
{"AVG_ROW_LENGTH", 21 , MYSQL_TYPE_LONG, 0, 1, "Avg_row_length"},
{"DATA_LENGTH", 21 , MYSQL_TYPE_LONG, 0, 1, "Data_length"},
{"MAX_DATA_LENGTH", 21 , MYSQL_TYPE_LONG, 0, 1, "Max_data_length"},
{"INDEX_LENGTH", 21 , MYSQL_TYPE_LONG, 0, 1, "Index_length"},
{"DATA_FREE", 21 , MYSQL_TYPE_LONG, 0, 1, "Data_free"},
{"CREATE_TIME", 0, MYSQL_TYPE_TIMESTAMP, 0, 1, "Create_time"},
{"UPDATE_TIME", 0, MYSQL_TYPE_TIMESTAMP, 0, 1, "Update_time"},
{"CHECK_TIME", 0, MYSQL_TYPE_TIMESTAMP, 0, 1, "Check_time"},
{"CHECKSUM", 21 , MYSQL_TYPE_LONG, 0, 1, "Checksum"},
{"STATUS", 20, MYSQL_TYPE_STRING, 0, 0, 0},
{"EXTRA", 255, MYSQL_TYPE_STRING, 0, 0, 0},
{0, 0, MYSQL_TYPE_STRING, 0, 0, 0}
};

So now we’ve added a table to I_S and defined the table. The universal create_schema_table will take care of creating the table for us – we just need to write the function to fill the table now!

If we want to test where we’re up to now, an empty function will do. For example, you could add something like this to sql_show.cc:

int fill_schema_files(THD *thd, TABLE_LIST *tables, COND *cond)
{
return 0;
}

Now you can build, run and test, but the real function is a lot more interesting. Since each storage engine can have its own files, we create a handlerton (the data structure for a table handler – see handler.h) call for the handler to fill out its bit of the FILES table.

struct run_hton_fill_schema_files_args
{
TABLE_LIST *tables;
COND *cond;
};

static my_bool run_hton_fill_schema_files(THD *thd, st_plugin_int *plugin,
void *arg)
{
struct run_hton_fill_schema_files_args *args=
(run_hton_fill_schema_files_args *) arg;
handlerton *hton= (handlerton *) plugin->plugin->info;
if(hton->fill_files_table)
hton->fill_files_table(thd, args->tables, args->cond);
return false;
}

int fill_schema_files(THD *thd, TABLE_LIST *tables, COND *cond)
{
int i;
TABLE *table= tables->table;
DBUG_ENTER("fill_schema_files");

struct run_hton_fill_schema_files_args args;
args.tables= tables;
args.cond= cond;

plugin_foreach(thd, run_hton_fill_schema_files,
MYSQL_STORAGE_ENGINE_PLUGIN, &args);

DBUG_RETURN(0);
}

So, the fill_schema_files() function is what the I_S code calls. Since in MySQL 5.1 we have a much improved plugin infrastructure, we use the plugin_foreach function which iterates over the loaded storage engines in a safe way. For each storage engine, we end up calling run_hton_fill_schema_files with the args structure as a parameter. As you can see, if the handlerton has a fill_files_table function, we call it.

Now that we’re here, we need to implement fill_files_table for some table handlers. For storage engines we’re not going to implement it, we need to make sure they set it to NULL (there’s a bunch of this going on in the aforementioned patch).

Since for the NDB table handler (ha_ndbcluster.cc) we have a lovely sane init function (ndbcluster_init) it’s much easier to support new handlerton calls:

h.fill_files_table= ndbcluster_fill_files_table;

That’s all that’s needed (apart from the implementation of course :)

So, what does the implementation look like? It’s easy to see how it works from the source, so I won’t go into details here. Basically, we fill out a row using data retreived via the NDBAPI and call schema_table_store_record(). How you construct the function for your storage engine may be quite different, enjoy!

One useful query that jeb is using in some of the tests is:

SELECT FILE_NAME, (TOTAL_EXTENTS * EXTENT_SIZE) AS 'Total Extent Size',
(FREE_EXTENTS * EXTENT_SIZE) AS 'Total Free In Bytes',
(((FREE_EXTENTS * EXTENT_SIZE)*100)/(TOTAL_EXTENTS * EXTENT_SIZE))
AS '% Free Space' FROM INFORMATION_SCHEMA.FILES
+ WHERE ENGINE="ndbcluster" and FILE_TYPE = 'DATAFILE';

You may want to CREATE VIEW blah AS … for this as it could be a useful query to run regularly