Data Storage miniconf for linux.conf.au 2011

Today I’ll be running the Data Storage Miniconf at linux.conf.au 2011. See the Tuesday Schedule on the LCA Website for the up to date schedule for today (the one in the badges probably out of date).

We’ve got some great talks today, so be sure to catch them. There’s also plenty of opportunity and time for discussion.

Monday linux.conf.au 2011 plan

It’s currently my plan to really try and make it to the following sessions:

The middle of the day will probably become “Stewart goes and panics over talks” kinda time.

Should be an awesome day.

Data Storage miniconf Lightning Talk CFP

Going to linux.conf.au ?

Use storage, have tales?

Admin storage system, have stories?

Hack on a storage system, have software to promote?

We want your Lightning Talk!

Databases, file systems, cloud storage, network storage, my-insane-mythtv-storage all welcome!

Send me email if you’d like to present (stewart at flamingspork dot com).

Tuesday, from 4:15pm at linux.conf.au

No implicit commit (on the road to transactional DDL)

A long time ago, in a time that can only serve to make some feel old and others older, MySQL didn’t support transactions. Each statement was executed as it went, there was no ROLLBACK (or COMMIT or crash recovery etc). Then there were transactions. Other RDBMSs implement auto_commit functionality, but for MySQL users, we think of it as the magic compatibility mode that (mostly) makes applications written for MyISAM magically work on InnoDB (okay, and making “you should use transactions” a really easy consulting gig :)

I’m currently working on finishing up a patch that removes the implicit COMMIT from DDL operations in Drizzle. Instead, you get an error message saying that Transactional DDL is not currently supported. I see a future where we have one of two situations (possibly depending on the storage engine): support DDL within normal transactions, DDL only transactions (cannot mix with DML). The latter (DDL only transactions) I see as the option for InnoDB/HailDB.

Is your Storage Engine buggy or the database server?

If your storage engine returns an error from rnd_init (or doStartTableScan as it’s named in Drizzle) and does not save this error and return it in any subsequent calls to rnd_next, your engine is buggy. Namely it is buggy in that a) an error may not be reported back to the user and b) everything may explode horribly when rnd_next is called after rnd_init returned an error.

Unless it is running on MariaDB 5.2 or (soon, when the patch hits the tree) Drizzle.

Monty (Widenius, not Taylor) wrote a patch for MariaDB based on my bug report that addressed that problem. It uses the compiler feature to throw a warning if the result of a function isn’t checked to make sure that all places that call rnd_init are checking for an error from the engine.

Today I (finally) pulled that into Drizzle as well.

So… if your engine does the logical thing and goes “oh look, this method returns an error… I’ll return my error” it will exhibit bugs in MySQL but not MariaDB 5.2 or Drizzle (when patch hits).

Which is buggy, the server or the engine?

The MySQL bug number is 54166, filed in June 2010.

Making B&W Prints

Hong Kong street

I’m getting better at making prints, and starting to understand how all the bits fit together properly. I’m finding myself disappointed that I’ve shot colour sometimes :)

The light-sealing of the darkroom (also known as laundry (also known as brewery)) is not exactly pretty… but it does work:

MySQL 5.5 is GA and 5.5.8 missing from launchpad…

While it’s great that MySQL 5.5 is GA with the 5.5.8 release (you can download it here), I’m rather disappointed that the bzr repositories on launchpad aren’t being kept up to date. At time of writing, it looked like this:

Yep – nothing for five weeks in the 5.5 repo – nothing since the 5.5.7 release :(

There hasn’t been zero changes either – the changelog has a decent number of fixes.

Persistent index statistics for InnoDB

In browsing the BZR tree for lp:mysql-server, I noticed some rather exciting code had been merged into the Innobase code.

You may be aware that InnoDB will do some index dives when opening a table to get some statistics about the indexes that can help the optimiser make good query plans.

The problem being that this is many disk seeks. It means that on server restart, you have to spend a whole bunch of time seeking around the disk reading index pages.

Not any more.

There is now code merged in to store the calculated statistics in a table inside InnoDB so that these index dives don’t have to happen on startup.

Originally, this looked like it was going to make it into InnoDB+. The good news is that it’s now in a public source tree. I look forward to when it hits a stable release.

(hopefully somebody other than me can beat me to it and write a nice description of the algorithms involved… the code is pretty easy to follow, so it shouldn’t be hard)

Replication log inside InnoDB

The MySQL replication system has always had the replication log (“binlog”) as a separate set of files on disk. Originally, this really didn’t matter as, well, MyISAM wasn’t transactional or crash safe so the binlog didn’t need to be either. If you crashed on a busy write workload, your replication was just going to be hosed anyway.

So then came a time where everybody used InnoDB. Transactional, crash-safe and all the goodies. Then, a bit later, came storing master rpl log position in InnoDB log and XA with the binlog. So a rather long time after MySQL first had replication, you could pull the power cord on the master with a decent amount of certainty that things would be okay when you turned it on again.

I am, of course, totally ignoring the slave state and if it’s safe to do that on slaves.

Using XA to make the binlog and InnoDB consistent does have a cost. That cost is fsync()s. You have to do a lot more of them (two phase commit here).

As you may be aware, at a (much) earlier point in Drizzle we completely ripped out the replication code. Why? A lot of it was very much still geared to support statement based replication – something we certainly didn’t want to support. We also did not really want to keep the legacy binlog format. We wanted it to be very, very pluggable.

So the initial implementation is a transaction log file. Basically, we write out the replication messages to a file. A slave reads this and applies the operations. Pretty simple and foolproof to implement.

But it’s pluggable.

What if we stored the transaction log inside innodb? Not only that, what if we wrote it as part of the transaction that was doing the changes? That way, no XA is needed – everything is consistent with a COMMIT. This would greatly reduce the number of fsync()s needed to be consistent.

Now… the first thing people will say is “arrggh! You’re writing the data *four* times now”. First being the txn data into the log, then the replication log into the log, and then both of these are written back to the data file. It turns out that this is much cheaper than doing the additional fsync()s.

In one of our tests, the file based transaction log: ~300tps. Transaction log in InnoDB: ~1200tps.

I think that’s an acceptable trade-off.

We’ve just merged the first bit of this work into Drizzle.

Props go to Joe Daly, Brian and myself for making it work.

The camera never lies

Of course it does! We have The GIMP and Photoshop! Well…

Back in the day, when everybody shot film, things were a bit more difficult. For a lot of operations it was pretty easy: select the right film, right exposure. For control you could vary how you developed it and beyond that, you could do a million things in the darkroom when printing. However, if you wanted to do something like combine 2 images or take out part of an image or smooth a skin tone, you were in for a lot more fun.

Retouching was done by changing the negative. If you wanted to remove that pimple from a portrait? Go get some paint and pant over it. This was tricky, as for 35mm film, this was very small and fiddly.

This is why publications such as Playboy shot on larger format film. From what I’ve read, either 120 (“medium format” to you and I – bigger than 35mm, but still not huge) or 4×5 (inches – much bigger) or even 8×10. While we can all wish that we too could get hold of some 8×10 Kodachrome to play with (and presumably a lab to process it for us) – those days are long gone.

With a negative of 8×10 inches, you have a lot more to play with and it’s much easier. For one thing, a contact print is as big as most enlargements people do from 35mm!

With humans essentially painting on negatives, it became relatively easy to spot when things had been manipulated (meaning there were experts who did it). However, with the increased sophistication of digital tools, creating quite realistic (even to expert eye) manipulations wasn’t that hard.

Recently, Canon (among others) has tried to bring technology to digital that would enable you to check that the image has not been manipulated after it came out of the camera.

This technology is, of course, flawed.

From the guy who enabled blind people to read eBooks comes the breaking of this system (Boing Bong and Network World).

“Pics or it didn’t happen” just is completely not true.

A more complete look at Storage Engine API

Okay… So I’ve blogged many times before about the Storage Engine API in Drizzle. This API is somewhat inherited from MySQL. We have very much attempted to make it a much cleaner interface. Our goals in making changes include: make it much easier to write and maintain a storage engine, make the upper layer code obviously correct and clear in what it’s doing and being able to more easily introduce optimisations.

I’ve recently added a Storage Engine that is only used in testing: storage_engine_api_tester. I’ve blogged on it producing call graphs (really state transition graphs) before both for Storage Engine and Cursor.

I’ve been expanding the test. My test engine is now a wrapper around a real engine instead of just a fake one. This lets us run real queries (and test cases) while testing what’s going on. At some point in the near future I plan to make it so that it will be able to log what calls go on to the engine and produce a graph just of those.

I added a lot more to the Storage Engine part of the wrapper. Below is what you can see is the current graph:

I’ve coded what I consider to be bugs as red and what I consider suspect as blue.

Also for the Cursor (colours mean the same):

As you can see, there’s currently some wacky possibilities. I’m investigating exactly what’s going on here – If I’m somehow missing some calls that I should be wrapping (I don’t think so) or if we are really doing some dumb-ass things in the upper layer.

Also, please do not be under any impression that any of this means that we’re going to have a stable API. We’re not. To stabilise on this would just be insane – way too much of it still makes not much sense.

Making my own B&W Prints

I managed to light seal the Laundry (not pretty… but it worked) and started playing with one of the enlargers I bought recently. I had a bit of an inkling from some reading I did ages ago about what I had to do to make prints.

I didn’t really have any developer meant for prints… so I just grabbed some Rodinal and dived right in. Basically started with the lens wide open and around 0.5 to 1 seconds exposure.

Because I was just experimenting, I skipped a stop bath (did a rinse though) and then straight into some fixer.

Here are the results of my experimentation (photos taken with my phone of the drying prints)

bench (print)

Leah

Contrast these with the scans of the negatives:

dedicated bench

by the water

Limiting functions to 32k stack in Drizzle (and scoped_ptr)

I wonder if this comes under “Code Style” or not…

Anyway, Monty and I finished getting Drizzle ready for adding “-Wframe-larger-than=32768” as a standard compiler flag. This means that no function within the Drizzle source tree can use greater than 32kb stack – it’s a compiler warning – and with -Werror, it means that it’s a build error.

GCC is not perfect at detecting stack usage, but it’s pretty good.

Why have we done this?

Well, there is a little bit of recursion in the server… and we can craft queries to blow a small stack (not so good). On MacOS X, the default thread stack size is only 512kb. This gives not many frames if 32kb stack is a even remotely common.

I found some interesting places to throw a lot of things on the stack too – that would be rather far down on a callchain – leading to the possibility of blowing up in really strange ways.

We’d love to make it 16kb…. but that’s a fair bit more work, so something for the future.

We’ve used the Boost scoped_ptr to address a bunch of these situations as it provides pretty much minimal code change for the same effect (except that memory is dynamically allocated instead of as part of the stack frame).

Drizzle gets InnoDB 1.0.9

My branch that updates the innobase plugin in Drizzle to be based on innodb_plugin 1.0.9 has been merged. For the next milestone, we’ll probably have 1.0.11 as well.

How’s the progress getting 1.1 and 1.2 in? Pretty good actually. We’ll have it for either this milestone or the next one.

and merging newer innodb into HailDB? It’s going well too, expect more news “soon”.

Cursor states

Following on from my post yesterday on the various states of a Storage Engine, I said I’d have a go with the Cursor object too. A Cursor is used by the Drizzle kernel to get and set data in a table. There can be more than one cursor open at once, and more than one per thread. If your engine cannot cope with this, it is its responsibility to figure it out and return the appropriate errors.

Let’s look at a really simple operation, inserting a couple of rows and then reading them back via a full table scan.

Now, this graph is slightly incomplete as there is no doEndTableScan() call. But you can see in which order things are meant to happen. In this case, “store_lock()” means that store_lock() has been called, so when coming back from doInsertRecord() we do not call store_lock() again, rather, we’re just in a state where it has already been executed.

For MySQL handler, think ::write_row() for doInsertRecord() and ::rnd_init() for doStartTableScan().

This diagram was again auto-generated from my test engine.