I’ve been working on BUG#17928, which is all about “testBackup fails in error handling testcases” which appeared after we merged in some work to the 5.1 tree (which is okay in 5.0) that changes some things in the way that online backups are done in NDB to better support recovery in the event of various types of failures and various times in the process.
Anyway, not all systems are affected by this bug… I’m at least reproducing some of the failures on my laptop and have spent the past while in the depths of the BACKUP and NDBFS blocks trying to work out what’s going on and why we’re hitting this assert.
NDBFS is an interesting block as it’s the file system interaction for NDB – so we’re doing things that could take an arbitrary amount of time. We don’t like waiting for those sorts of things in cluster, so we go on and do other work.
For backup, we buffer up writes and send FSAPPENDREQ (File System Append Request) to NDBFS only when we have a reasonable amount of data to send. For example, for very small rows being put into the log file, no use making a write() call for each row – batch them into big chunks!
Back to the bug; I’ve got a theory[1], half a patch and some work to do tomorrow.
[1] excluding gratuitous Buffy quote
Pingback: Ramblings » Blog Archive » A Followup on: a bug on failure failure