Which is bigger: MySQL or PostgreSQL?

From my previous posts, we have some numbers (excluding NDB) for the size of MySQL, so what about PostgreSQL? Here, I used PostgreSQL git trunk and classing things in the contrib/ directory as plugins. I put the number of lines of code in the src/backend/storage directory down as storage engines LoC but did not count it as non-kernel code.

Version Total LoC Plugin LoC Storage Engines LoC Remaining (kernel)
MySQL 5.5.30 858,441 2,706 171,009 684,726 (79% kernel)
MySQL 5.6.10 1,049,344 29,122 236,067 784,155 (74% kernel)
MariaDB 5.5 1,142,118 11,781 304,015 826,322 (72% kernel)
Drizzle trunk 334,810 31,150 130,727 172,933 (51% kernel)
PostgreSQL trunk 648,691 61,934 17,802 586,757 (90% kernel)

What we can see is that the PostgreSQL kernel size is actually smaller than any recent MySQL version (5.1 was slightly smaller). This is rather interesting as it is generally thought that PostgreSQL does more than MySQL. What’s more telling is that total code size, PostgreSQL is about half of MySQL 5.6 or MariaDB 5.5. Only Drizzle ends up being smaller, which makes sense as it “does less”.

MySQL modularity, are we there yet?

MySQL is now over four times the size than it was with MySQL 3.23. This has not come in the shape of plugins.

Have we improved modularity over time? I decided to take LoC count for plugins and storage engines (in the case of Drizzle, memory, myisam and innobase are storage engines and everything else comes under plugin). I’ve excluded NDB from these numbers as it is rather massive and is pretty much still a separate thing.

Version Total LoC Plugin LoC Storage Engines LoC Remaining (kernel)
MySQL 3.23.58 371,987 0 (0%) 176,276 195,711 (52% kernel)
MySQL 5.1.68 721,331 228 237,124 483,979 (67% kernel)
MySQL 5.5.30 858,441 2,706 171,009 684,726 (79% kernel)
MySQL 5.6.10 1,049,344 29,122 236,067 784,155 (74% kernel)
MariaDB 5.5 1,142,118 11,781 304,015 826,322 (72% kernel)
Drizzle trunk 334,810 31,150 130,727 172,933 (51% kernel)

I’ve used the non-plugin and non-storage engine code size to be the database “kernel” – i.e. the core of the database server.

What I find really interesting here is that yes, the amount of code that is to some degree modular has increased. The amount of code that is a MySQL plugin is still very small compared to the server size

Drizzle is 20-25% of the size of a modern MySQL or MariaDB server and for many applications does largely or exactly the same thing.

unireg.h is finally gone

I got rid of unireg.cc way back in 2009 as I rewrote all the FRM related code inside Drizzle to instead use a nice protobuf based structure. If you’re wondering what was there, I just quote this part of pack_screens() from unireg.cc in MySQL 5.6:

start_row=4; end_row=22; cols=80; fields_on_screen=end_row+1-start_row;

We have gradually pulled things out of unireg.h over the years too. But, let’s go back to ask the question “What is UNIREG?”. To answer that, I’m going to quote from something that was current back when MySQL 3.22 was the latest and greatest:

In 1979, he developed an in-house database tool called UNIREG for managing databases. Since 1979, UNIREG has been rewritten in several different languages and extended to handle big databases.

No doubt the definition of big has changed for most people since then. If we take what the MySQL 3.23 manual said:

Unireg is our tty interface builder, but it uses a low level connection to our ISAM (which is used by MySQL) and because of this it is very quick. It has existed since 1979 (on Unix in C since ~1986).

This is where the FRM file comes from, and likely where unireg.cc and unireg.h in MySQL  originated from. Since we didn’t want to have the limitations of FRM files in Drizzle, and since all code that mentioned UNIREG was generally not modern, we’ve been gradually removing bits or refactoring it since.

So, what was in unireg.h to begin with? I peeked into the MySQl 5.6 unireg.h to remind myself:

  • defines for libc5
  • defines for the MySQL thing that isn’t gettext
  • the PLUGINDIR define
  • some macros for errors
  • a bunch of SPECIAL_ prefixed defines for a bitfield which is a truly odd thing.
  • store_record(), restore_record(), cmp_record(), empty_record() macros
  • defines for flags to openfrm(), openprt() and openfrd() (the latter two not being present in MySQL)
  • defines used in relation to INFORMATION_SCHEMA tables and triggers
  • the BIN_LOG_HEADER_SIZE define
  • the DEFAULT_KEY_CACHE_NAME define

Soo… really a whole bunch of stuff that should never have been put there in the first place as most of that has clearly come in after it was MySQL.

But I’m getting sidetracked, the main point was that I looked at what was remaining in unireg.h inside Drizzle and it was just one thing: Code to abort the server in certain odd situations (that we’d completely rewritten so had no relation to unireg). So, I renamed it.

Renaming a file is a pretty minor change, but that hides all the work leading up to that point so that now I can go and explain Drizzle internals while making one less reference to UNIREG.

Now to go and remove the last tiny bit of unireg_check…

Fun with Coverity found bugs (Episode 1)

Taking the inspiration of  great series of blog posts “Fun with Bugs” (and not http://funwithbugs.com/ which is about both caring for and eating bugs), and since I recently went and run Coverity against Drizzle, I thought I’d have a small series of posts on bugs that it has found (and I’ve fixed).

An idea that has been pervasive in the Drizzle project (and one that I rather subscribe to) is that there is two types of correct: correct and obviously correct. Being obviously correct is much, much better than merely being correct.

The first category of problems that Coverity found was kind of interesting, there was a warning that data_file_name and index_file_name in class ha_myisam weren’t initialized in the ha_myisam constructor nor in any function that it calls. It turns out that this was basically because the code wasn’t exactly optimal, and these variables were used kind of oddly. In fact, in writing this blog post I went back and found that there’s a bunch of extra dead code and these should just be removed, along with the code that “used” them.

The historical use for data_file_name and index_file_name were that (in MySQL) you could specify different paths for MyISAM data and index files, so that the FRM ended up in the server datadir, the data file ended up some other place and the index file was off behind the sofa. Since MyISAM is used only for temporary tables in Drizzle, this is entirely not needed.

Another place where a similar bug was found by Coverity was in the SQLExecutor class of the json_server plugin. The _err variable wasn’t initialized in the constructor. After some careful auditing, I think this was actually a false positive as it was set to something before being used, but it was pretty simple to prevent future bugs by initializing it.

Two instances of the same warning, one just found a bunch of code to delete (rather useful) and the other is rather minor but may help someone in the future.

Coming up next: total embarrassment bugs.

Coverity scan for Drizzle

Coverity is a static analysis tool, which although proprietary itself does offer a free scanning service for free and open source software (which is great by the way, I totally owe people who make that happen a frosty beverage).

Prompted by someone volunteering to get MariaDB into the Coverity Scan, I realized that I hadn’t actually followed through with this for Drizzle. So, I went and submitted Drizzle. As a quick overview, this is the number of problems of each severity both projects got back:

Severity MariaDB Drizzle
High 178 96
Medium 1020 495
Low 47 52

I don’t know what MySQL may be, but it’d be great to see this out in the open too.

Impact of MySQL slow query log

So, what impact does enabling the slow query log have on MySQL?

I decided to run some numbers. I’m using my laptop, as we all know the currently most-deployed database servers have mulitple cores, SSDs and many GB of RAM. For the curious: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz

The benchmark is going to be:
mysqlslap -u root test -S var/tmp/mysqld.1.sock -q 'select 1;' --number-of-queries=1000000 --concurrency=64 --create-schema=test

Which is pretty much “run a whole bunch of nothing, excluding all the overhead of storage engines, optimizer… and focus on logging”.

My first run was going to be with the slow query log on. I’ll start the server with mysql-test-run.pl as it’s just easy:
eatmydata ./mysql-test-run.pl --start-and-exit --mysqld=--slow-query-log --mysqld=--long-query-time=0

The results? It took 18 seconds.

How long without the slow query log (starting with mysql-test-run.pl again, but this time without any of the extra mysqld options)? 13 seconds.

How does this compare to a Drizzle baseline? On a freshly build Drizzle trunk, using the same mysqlslap binary I used above, connecting via UNIX socket: 8 seconds.

Finding out What’s Next at BarCampMel 2012 with Drizzle, SQL, JavaScript and a web browser

Just for the pure insane fun of it, I accepted the challenge of “what can you do with the text format of the schedule?” for BarCampMel. I’m a database guy, so I wanted to load it into a database (which would be Drizzle), and I wanted it to be easy to keep it up to date (this is an unconference after all).

So… the text file itself isn’t in any standard format, so I’d have to parse it. I’m lazy and didn’t want to leave the comfort of the database. Luckily, inside Drizzle, we have a js plugin that lets you execute arbitrary JavaScript. Parsing solved. I needed to get the program and luckily we have the http_functions plugin that uses libcurl to allow us to perform HTTP GET requests. I also wanted it in a table so I could query it when not online, so I needed to load the data. Luckily, in Drizzle we have the built in EXECUTE functionality, so I could just use the JavaScript to parse the response from the HTTP GET request and construct SQL to load the data into a table to then query.

So, grab your Drizzle server with “plugin-add=js” and “plugin-add=http_functions” in the config file or as options to drizzled (prefixed with –) and….

This simple one liner pulls the current schedule and puts it into a table called ‘schedule’:

SELECT EXECUTE(JS("function sql_quote(s) {return s ? '\"'+ s.replace('\"', '\\\"') + '\"' : 'NULL'} function DrizzleDateString(d) { function pad(n) { return n<10 ? '0'+n : n } return d.getFullYear()+'-'+pad(d.getMonth()+1)+'-'+pad(d.getDate())+' '+pad(d.getHours())+':'+pad(d.getMinutes())+':'+pad(d.getSeconds()) } var sql = 'COMMIT;CREATE TABLE IF NOT EXISTS schedule (start_time datetime, stage varchar(1000), mr2 varchar(1000), mr1 varchar(1000), duration int); begin; delete from schedule;' ; var time= new Date;var input= arguments[0].split(\"\\n\"); var entry = new Array(); var stage, mr2, mr1; for(var i=0; i < input.length; i++) { var p= input[i].match('^(.*?) (.*)$'); if(p) {if(p[1]=='Time') { time=new Date(Date.parse(p[2]));} if(p[1]=='Duration') { sql+='INSERT INTO schedule (start_time,stage,mr2,mr1,duration) VALUES (\"' + DrizzleDateString(time) + '\", ' + sql_quote(stage) + ', ' + sql_quote(mr2) + ',' + sql_quote(mr1) + ',' + p[2] + '); '; time= new Date(time.getTime()+p[2]*60*1000); stage= mr2= mr1= ''; } if(p[1]=='stage') {stage=p[2]} if (p[1]=='mr2') {mr2=p[2]} if (p[1]=='mr1') {mr1=p[2]} }}; sql+='COMMIT;'; sql", (select http_get('https://dl.dropbox.com/s/01yh7ji7pswjwwk/live-schedule.txt?dl=1'))));

Which you can then find out “what’s on now and coming up” with this query:

select * from schedule where start_time > DATE_ADD(now(), INTERVAL 9 HOUR) ORDER BY start_time limit 2\G
But it’s totally not fun having to jump to the command line all the time, and you may want it in JSON format for consuming with some web thing…. so you can load the json_server plugin and browse to the port that it’s running on (default 8086) and type the SQL in there and get a JSON response, or just look at the pretty table there.

Hacking the Jenkins BZR plugin

For Drizzle and for all of the projects we work on at Percona we use the Bazaar revision control system (largely because it’s what we were using at MySQL and it’s what MySQL still uses). We also use Jenkins.

We have a lot of jobs in our Jenkins. A lot. We build upstream MySQL 5.1, 5.5 and 5.6, Percona Server 5.1, Percona Server 5.5, XtraBackup 1.6, 2.0 and 2.1. For each of these we also have the normal trunk builds as well as parameterised ones that allow a developer to test out a tree before they ask for it to be merged. We also have each of these products across seven operating systems and for each of those both x86 32bit and 64bit. If we weren’t already in the hundreds of jobs, we certainly are once you multiply out between release and debug and XtraBackup being across so many MySQL and Percona Server versions.

I honestly would not be surprised if we had the most jobs of any user of the Bazaar plugin to Jenkins, and we’re probably amongst the top few of all Jenkins installations.

So, in August last year we discovered a file descriptor leak in the Bazaar plugin. Basically, garbage collection doesn’t get kicked off when you run out of file descriptors. This prevented us from even starting back up Jenkins until I found and fixed the bug. Good times.

We later hit a bug that was triggered in the parallel loading of jobs during startup. We could get stuck in an infinite loop during Jenkins starting that would just eat CPU and get nowhere. Luckily Jenkins provides a workaround: specify “-Djenkins.model.Jenkins.parallelLoad=false” as an argument and it just does it single threaded. For us, this solves that problem.

We were also hitting another problem. If you kill bzr at just the wrong time, you can leave the repository in not an entirely happy state. An initial branch can be killed at a time where it’ll think it’s a repository rather than a checkout and there’s a bunch of other weirdness (including file system corruption if you happen to use bad VM software).

The way we were solving this was to sometimes go and “clean workspace” on the jobs that needed it (annoying with matrix builds). We’d switched to just doing “clean tree” for a bunch of builds. The problem with doing a clean tree was that “bzr branch” to check out the source code could take a very long time – especially for Percona Server which is a branch of MySQL and hence has hundreds of megabytes of history.

We couldn’t use bzr shared repositories as we kept hitting concurrency bugs when more than one jenkins job was trying to do a bzr operation at the same time (common when matrix builds kick off builds for release and debug for example).

So.. I fixed that in the Jenkins bazaar plugin too (which should be in an upcoming release) and we’ve been running it on our Jenkins instance for the past ~2 months.

Basically, if we fail to check out the Bazaar tree, we wipe it clean and try again (Jenkins has a “retry count” for source checkouts). This is a really awesome form of self healing. Even if the bazaar team fixed all the bugs, we’d still have to go and get that new version of bzr on all our build machines – including ancient systems such as CentOS 5. Not as much fun as bashing your head into a vice.

After all of that, I seem to now be the maintainer of the Bazaar plugin for Jenkins as Monty pointed out I was using it a lot more than him and kept finding and fixing bugs.

Soooo… say hello to the new Jenkins Bazaar plugin maintainer, me.

Yes, I maintain Java code now. Be afraid. Be very afraid.

There is a story….

I have a friend who is fond of telling a story from way back in November 2008 at the OpenSQL camp in Charlottesville, Virgina. This was relatively shortly after we had announced to the public that we’d started something called Drizzle (we did that at OSCON) and was even closer to the date I started working on Drizzle full time (which was November 1st). Compared to what it is now, the Drizzle code base was in its infancy. One of the things we hadn’t yet sorted out was the rewrite of the replication code.

So, I had my laptop plugged into a projector, and somebody suggested opening up some random source file… so I did. It was a bit of the replication code that we’d inherited from MySQL. Immediately we spotted a bug. In fact, between myself and Brian I think we worked out that none of the error handling in this code path ever even remotely worked.

Fast forward a bunch of years, and recently I had open part of the replication code in MySQL 5.5 and (again) instantly spotted a bug. Well.. the code is correct in 2 out of 3 situations…

It is rather impressive that the MySQL Replication team has managed to add the features they have in MySQL 5.6.

I’m also really happy with what we managed to do inside Drizzle for replication. Ripping out all the MySQL legacy code was a big step to take, and for a while it seemed like possibly the wrong one  – but ultimately, it was incredibly the right thing to do. I love going and looking at the Drizzle replication code. I simply love it.

Drizzle Day 2012

Henrik has already posted it over on the Drizzle Blog, but I thought I’d give a shout out here too.

We’re holding a Drizzle Day right after the Percona Live MySQL Conference and Expo in April. So, since you’re all like me and don’t book your travel this far in advance, it’ll be easy to stay for the extra day and come and learn awesome things about Drizzle.

I’m also pretty glad that my employer, Percona is sponsoring the event.

Speaking at Percona Live London 2011 (on Drizzle!)

Both Henrik and myself will be at Percona Live London 2011 in late October speaking on the wonderful Drizzle database server.

Other speakers at the conference will be talking about a wide range of topics surrounding the MySQL ecosystem including performance monitoring, backup, search, scaling and data recovery.

P.S. I do have a discount code – ask me in the comments for it!

Joining Percona

As you may have read on the MySQL Performance Blog post – I’ve recently joined Percona. This is a fairly exciting next step. I’ll be in New York for Percona Live next week, where I’ll be giving a session titled “Drizzle 7, GA and Supported: Current & Future Features”.

I’ll write more soon, there’s a lot keeping me busy already!

Drizzle JSON interface merged

https://code.launchpad.net/~stewart/drizzle/json-interface/+merge/59859

Currently a very early version of course, but it’s there in trunk if you want to play with it. Just have libcurl and libevent installed and you can submit queries via HTTP and JSON. Of course, the next steps are getting a true non-sql interface going and seeing how people go with it.

HTTP JSON AlsoSQL interface to Drizzle

So… I had another one of those “hrrm… this shouldn’t be hard to hack a proof-of-concept” moments. Web apps are increasingly speaking JSON all around the place. Why can’t we speak JSON to/from the database? Why? Seriously, why not?

One reason why MongoDB has found users is that JSON is very familiar to people. It has gained popularity in spite of having pure disregard for the integrity and safety of your data.

So I started with a really simple idea: http server in the database server. Thanks to the simple code to do that with libevent, I got that going fairly quickly. Finding a rather nice C++ library to create and parse JSON was the next challenge. I found JSONcpp, a public domain library with a nice API and proceeded to bring it into the tree (it’s not much code). I then created a simple way to find out the version of the Drizzle server you were speaking to:

$ curl http://localhost:8765/0.1/version
{
   "version" : "2011.04.15.2285"
}

But that wasn’t nearly enough… I also wanted to be able to issue arbitrary queries. Thanks to the supporting code we have in the Drizzle server for EXECUTE() (also used by the replication slave), this was also pretty easy. I created a way to execute the content of a HTTP POST request as if you had done so with EXECUTE() – all nicely wrapped in a transaction.

I created a simple table using the drizzle client, connecting over a normal TCP socket speaking the MySQL protocol and inserted a row in it:

$ ../client/drizzle --port 9306 test
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 4
Connection protocol: mysql
Server version: 2011.04.15.2285 Source distribution (json-interface)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `a` INT NOT NULL AUTO_INCREMENT,
  `b` VARCHAR(100) COLLATE utf8_general_ci DEFAULT NULL,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB COLLATE = utf8_general_ci
1 row in set (0.001209 sec)

drizzle> insert into t1 (b) values ("from mysql protocol");
Query OK, 1 row affected (0.00207 sec)

Now to select rows from it via HTTP and get a JSON object back with the result set:

$ curl http://localhost:8765/0.1/sql --data 'select * from t1;'
{
   "query" : "select * from t1;",
   "result_set" : [
      [ "1", "from mysql protocol" ],
      [ "", "" ]
   ],
   "sqlstate" : "00000"
}

I can also insert more rows using the HTTP interface and then select them from the MySQL protocol interface:

$ curl http://localhost:8765/0.1/sql --data 'insert into t1 values (NULL, \"from HTTP!\");'
{
   "query" : "insert into t1 values (NULL, \\\"from HTTP!\\\");",
   "sqlstate" : "00000"
}

drizzle> select * from t1;
+---+---------------------+
| a | b                   |
+---+---------------------+
| 1 | from mysql protocol | 
| 2 | from HTTP!          | 
+---+---------------------+
2 rows in set (0.000907 sec)

So what does this get us? With the addition of proper authentication, you could start doing some really quite neat and nifty things. I imagine we could add interfaces to avoid SQL and directly do key lookups, table scans and index range scans, giving really quite sweet performance. We could start building web tools to manage and manipulate the database speaking the native language of the web.

But… there’s more!

Since we have a web server and a way to execute queries via HTTP along with getting the result set as JSON, why can’t we have a simple Web UI for monitoring the database server and running queries built into the database server?

Yes we can.

If you wanted a WHERE condition or anything else, easy. Change the query, hit execute:

No TCP connection or parsing the MySQL protocol or anything. Just HTTP requests straight to the database server from the browser with a bit of client side javascript producing the HTML for the table.

Proof of concept code is up on launchpad in lp:~stewart/drizzle/json-interface

Speaking on Tuesday: HailDB and Dropping ACID: Eating Data in a Web 2.0 Cloud World

I’m giving two talks tomorrow (Tuesday) at the MySQL Conference and Expo:

HailDB: A NoSQL API direct to InnoDB, 2:00pm, Ballroom D

Dropping ACID: Eating Data In A Web 2.0 Cloud World 3:05pm, Ballroom G

The HailDB talk is all about a C API to embed an InnoDB based relational database engine into your application. Awesome stuff (also nice and technical).

The second talk, “Dropping ACID: Eating Data in a Web 2.0 Cloud World” is not only a joke that only database people get, but a humorous and serious look at data integrity and reliability as promised by the current hype. This was quite well received at linux.conf.au in January. So, if you weren’t in Australia in January this year, then certainly come along and see how you go heckling an Australian.

Drizzle Developer Day

As Lee mentioned on the Drizzle blog, we have an upcoming Drizzle Developer Day just after the MySQL Conference and Expo. Sign up here to make sure we have enough space and can help with planning.

The last couple developer days have been great – helping people getting started with Drizzle, discussing improvements that could be made (both big and small), operations concerns and new to this one: a GA release.

Having a GA release out is really exciting, I’m hoping that on the developer day we get people coming who are looking at developing with Drizzle, not just hacking on internals.

Moar users nom nom nom!

xtrabackup for Drizzle merge request

Follow it over on launchpad.

After having fixed an incredibly odd compiler warning (and with -Werror that we build with, error) on OSX (die die die) – xtrabackup for Drizzle is ready to be merged. This will bring it into our next milestone: freemont. Over the next few weeks you should see some good tests merged in for backup and restore too.

While not final final, I’m thinking that the installed binary name will be drizzlebackup.innobase. A simple naming scheme for various backup tools that are Drizzle specific. This casually pre-empts a drizzlebackup tool that can co-ordinate all of these (like the innobackupex script).

Drizzle online backup with xtrabackup

For backups, historically in the MySQL world you’ve had mysqldump (a SQL dump, means on restore you have to rebuild indexes), InnoDB Hot Backup (proprietary, but takes a copy of the InnoDB data files, so restore is much quicker), LVM snapshots (various scripts exist, does have larger IO impact, requires LVM) and more recently xtrabackup. Xtrabackup essentially does the same thing as InnoDB hot backup except that it’s free and open source software.

Many people have been using xtrabackup successfully for quite a while now.

In Drizzle7, our default storage engine is InnoDB. There have been a few changes, but it is totally InnoDB. This leaves us with the question of backup solutions. We have drizzledump (the Drizzle equivalent to MySQL dump – although with fewer gotchas), you could always use LVM snapshots and the probability of Oracle releasing InnoDB Hot Backup for Drizzle is rather minimal.

So enter xtrabackup as a possible solution… I had though of porting xtrabackup across for a while. Last weekend, while waiting for one of my iterations of catalog support to compile, I decided to give it a go. I wanted to see how far I could get with it also in that weekend.

I was successful – there’s a tree up at lp:~stewart/drizzle/xtrabackup thatproduces an xtrabackup binary that’s built for Drizzle (it’s not quite ready for merging yet, there are some obivous bugs around command line option parsing… but a backup and restore did work).

I wanted the following:

  • build to be integrated with Drizzle, using the same innobase build that we use to build the server
  • build with strict compiler warnings and -Werror (which we do forDrizzle)
  • build with a C++ compiler (as we do with innobase in Drizzle)
  • not re-add parts of mysys into the Drizzle build just for xtrabackup

I’ve already submitted merge requests to upstream xtrabackup containing the compiler fixes and added compiler warnings (they’ve also by now been merged into xtrabackup). Already my work has improved the quality of xtrabackup for everyone. Some of the warnings were fixed slightly differently in xtrabackup than in my Drizzle tree, but I plan to merge.

One issue was that the command line parsing library that xtrabackup uses – my_getopt which is part of mysys (the portability library inside MySQL) is long since gone from Drizzle. We currently use Boost::program_options. Thanks to the heroic efforts of Andrew Hutchings, xtrabackp in Drizzle is also using boost::program_options. This was a brilliant “hey, can you have a look at this conversion” followed by handing him a tree that did not even remotely compile, followed by a “I have to take the kids somewhere, here’s a tree – it may compile”. Amazingly enough, it pretty much did compile once I fixed the other issues.

An unresolved issue is how to deal with this going forward – my guess is that upstream xtrabackup doesn’t want to require Boost.

One solution could be just to factor out command line options into a sepfile that we can ignore for Drizzle and replace with our own. The other option could be to use a differnt command line option parsing library (perhaps from CCAN, as it’s then maintained by somebody else and doesn’t require heaps and heaps of other stuff).

Another issue I had to tackle is the patch to innobase that’s required to build xtrabackup.

I took a very minimal approach for the Drizzle patch. We are currently based on innobase 1.1.4 from MySQL 5.5 – so I mostly looked at the xtradb55 patch. I think it would be great if these were instead of one giant patch a series of patches to apply (a-la quilt) to a) make iteasier maintain and b) easier for myself to work out the exact reasoning of each bit (also, generating the patches with -p would help a fair bit too).

So how did I do it?

Step 0
was removing support for old innobase – we totally don’t need it for Drizzle.

Step 1
was creating a srv_read_only option for Drizzle’s innobase. This was fairly easy. The one thing I did have to change was adding a checkin os_file_lock() so that we don’t attempt to write lock the ibdatafiles when in read only (otherwise backups can’t be taken while drizzledis running). I’m a little surprised that this wasn’t hit in 5.5 at all.

Step 2
was implementing srv_fake_write. I’m pretty sure I’ve gotten this right in the Drizzle implementation, but the patch wasn’t as easy toread as I’d really like. I probably need to do a bit more of a code audit that this is actually correct (I may try and come up with anLD_PRELOAD library that will scream loudly if writes are made to files matching a pattern).

Step 3
was implemnting srv_apply_log_only. Pretty sure I have this right, again, more testing will be required. Why? Because I’m that paranoid about getting things very, very right.

Step 4
was to go through all the functions that xtrabackup needed to not be static. Instead of having prototypes for them inxtrabackup.cc, I instead added a xtrabackup_api.h header to Innobase and included it where needed (including in xtrabackup). I’d recommend this way going forward for xtrabackup too as it could be a lot less problematic to maintain (and makes xtrabackup source a bit easier to read)

Step 5
was fixing up a few skeleton functions that were needed to make our innobase happy. It may not be a bad idea to split out the skeleton functions into a sep source file so it’s a bit easier to track (and some #ifdefs around those not needed for certain releases).

I’m hoping to work with the upstream xtrabackup devs on the various points I’ve made above.
Another thought of mine is to port xtrabackup into HailDB where we can use much more neat API functions to create good tests for xtrabackup.

Thanks go out to all who’ve worked on xtrabackup. It honestly wasn’t too hardgetting it ported across to Drizzle – and with a bit of collaboration I think we can make it easy to keep up to date.

What’s the future for Xtrabackup in Drizzle? It’ll likely end up being a binary named drizzlebackup-innobase or similar (this means that there is a clear difference between xtrabackup for MySQL and what we have in Drizzle – which is more accurately defined as based on xtrabackup). We’ll also probably want a nice wrapper or integration with a backup tool to deal with everything Drizzle related. We shall also introduce a lot of testing; backups are important.

Xtrabackup is topical, check out the latest OurSQL podcast and the the Percona Xtrabackup website for more info!