I picked this up in a market in Stockholm.
It’s sitting on top of a diagram of a bunch of classes in ndb_mgmd and in front of The Art of Computer Progromming (Knuth, Volumes 1-3). On top of that is Transactional Information Systems.
I picked this up in a market in Stockholm.
It’s sitting on top of a diagram of a bunch of classes in ndb_mgmd and in front of The Art of Computer Progromming (Knuth, Volumes 1-3). On top of that is Transactional Information Systems.
Every so often you come across people desiring intense XML and RDBMS interaction. Recently, here: Technical Notes and Articles of Interest » MySQL 5.1 is gaining some momentum.
In MySQL land, this usually means “An XML Storage Engine would be great”. hrmm… a storage engine is a way to store a table. A table is a relational thingy. Surely what’s wanted is a good way to query an XML document and search for a specific XML document.
So, is what’s really wanted good XML indexers and the ability to extract parts (or the whole) of a document? Maybe there’s some extension to fulltext (along with some funky functions) that could bring an immense amount of power with dealing with XML?
What is there now? What do people do now? What do they want to do?
All interesting stuff
ticketmaster.com.au – The White Stripes
Damn, damn, damn, damn damn. Only January 28th – and I’m in NZ.
Note to future organisers: make sure dates don’t overlap BDO or any really cool band tour dates.
Of course, the real disaster would be if Tool were touring at the same time as a work thing. How will people take it if i leave a company event for however long is needed to see Tool live. as many times as possible. I am dearly hoping that travel co-ordinates itself to see them in different cities, countries. Heck, even another planet if we can do that by the time the new album is ready :)
Some people don’t seem to get the Tool thing. It’s just good music. But that’s the thing – it is good music. Also, great music to hack with. I reckon each album gets played at least once per week – still.
that is all.
err… okay, a bit more. Breakfast this morning consisted of doing the dishes (good first step), some toast with jam and vegemite and a mango. Yum. All the time listening to the recording of the company-wide conf call from the other day (2am was just a little bit late that night).
These conf calls are really good actually – being able to throw questions directly at the top (and have them answered) is a great thing. Also getting to know what is going on from a higher perspective is really valuable.
Microsoft’s file system patent upheld: ZDNet Australia: News: Software
Saying any part of the FAT file system is “novel and non-obvious” is rather like saying being stabbed in the eye with a fork is “novel and a good way to spend a sunday afternoon”.
Seriously – what the?
I’m really glad I work for a company that opposes software patents.
Thanks to Pia for the links.
The process for starting up a cluster is pretty interesting. Where, of course, “interesting” is translated to “complex”. There’s a lot of things you have to watch out for (namely you want one cluster, not two or ten or anything). You also want to actually start a cluster, not just wait forever for everybody to show up.
Except in some situations. For example, initial start. With an initial start, you really want to have all the nodes present (you don’t want to run the risk of starting up two separate clusters!).
Bug 15695 is a bug to do with Initial Start. If you have three nodes (a management node and two data nodes) and break the network connection just between the two data nodes, and then reconnect it (at the wrong time – where the wrong time means you trigger the bug) the cluster will never start. A workaround is to restart one of the data nodes and everything comes up.
Note that this is just during initial start so it’s not a HA bug or anything. Just really annoying.
This seems to get hit when people have firewalls stopping the nodes talking to each other and then fix the firewall (but not shutting down the cluster).
As is documented in the bug, you can replicate this with some iptables foo.
One of the main blocks involed in starting the cluster (and managing it once it’s up) is the Quorum manager – QMGR. You’ll find the code in ndb/src/kernel/blocks/qmgr/. You’ll also find some in the older CMVMI (Cluster Manager Virtual Machine Interface).
A useful thing to do is to define DEBUG_QMGR_START in your build. This gives you some debugging output printed to the ndb_X_out.log file.
The first bit of code in QmgrMain.cpp is the heartbeat code. execCM_HEARTBEAT simply resets the number of outstanding heartbeats for the node that sent the heartbeat. Really simple signal there.
During SR (System Restart) there is a timeout period for which we try to wait for nodes to start. This means we’ll be starting the cluster with the most number of nodes present (it’s much easier doing a SR with as many nodes as possible than doing NR – Node Recovery – on lots of nodes). NR requires copying of data over the wire. SR probably doesn’t. Jonas is working on optimised node recovery which is going to be really needed for disk data. This will only copy the changed/updated data over the wire instead of all data that that node needs. Pretty neat stuff.
We implement the timeout by sending a delayed signal to ourself. Every 3 seconds we check how the election of a president is going. If we reach our limit (30seconds) we try to start the cluster anyway – not allowing other nodes to join).
Current problem is that each node in this two node not-quite-yet cluster thinks it has won the election and so switches what state it’s in to ZRUNNING (see Qmgr.h) hence stopping the search for other nodes. When the link between the two nodes is brought back up – hugs and puppies do not ensue.
I should have a patch soon too.
For a more complete explanation on the stages of startup, have a look at the text files in ndb/src/kernel/blocks. Start.txt is a good one to read.
for the non lisp hackers – this sets some c mode options depending on the name of the path to the source file.
;; run this for mysql source
(defun mysql-c-mode-common-hook () (setq indent-tabs-mode nil))
;; linux kernel style
(defun linux-c-mode-common-hook () linux-c-mode)
(setq my-c-mode-common-hook '(lambda ()
(turn-on-font-lock)
(setq comment-column 48)
)
)
;; predicates to check
(defvar my-style-selective-mode-hook nil)
(add-hook 'my-style-selective-mode-hook
'((string-match "MySQL" (buffer-file-name)) . mysql-c-mode-common-hook)
)
(add-hook 'my-style-selective-mode-hook
'((string-match "linux" (buffer-file-name)) . linux-c-mode-common-hook)
)
;; default hook
(add-hook 'my-style-selective-mode-hook
'(t . my-c-mode-common-hook) t)
;; find which hook to run depending on predicate
(defun my-style-selective-mode-hook-function ()
"Run each PREDICATE in `my-style-selective-mode-hook' to see if the
HOOK in the pair should be executed. If the PREDICATE evaluate to non
nil HOOK is executed and the rest of the hooks are ignored."
(let ((h my-style-selective-mode-hook))
(while (not (eval (caar h)))
(setq h (cdr h)))
(funcall (cdar h))))
;; Add the selective hook to the c-mode-common-hook
(add-hook 'c-mode-common-hook 'my-style-selective-mode-hook-function)
For XFS, in normal operation, an extent is only allocated when data has to be written to disk. This is called delayed allocation. If we are extending a file by 50MB – that space is deducted from the total free space on the filesystem, but no decision on where to place that data is made until we start writing it out – due to memory pressure or the kernel automatically starts writing the dirty pages out (the sync once every 5 seconds on linux).
When an extent needs to be allocated, XFS looks it up in one of two b+trees it has of free space. There is one sorted by starting block number (so you can search for “an extent near here”) and one by size (so you can search for “an extent of x size”).
The ideal situation being that you want as large an extent as possible as close to the tail end of the file as possible (i.e. just making the current extent bigger).
The worst-case scenario is having to allocate extents to multiple files at once with all of them being written out synchronously (O_SYNC or memory pressure) as this will cause lots of small extents to be created.
Here I’m going to talk about how file systems store what part of the disk a part of the file occupies. If your database files are very fragmented, performance will suffer. How much depends on a number of things however.
XFS can store some extents directly in the inode (see xfs_dinode.h). If I’m reading things correctly, this can be 2 extents per fork (data fork and attribute fork). If more than this number of extents are needed, a btree is used instead.
HFS/HFS+ can store up to 8 extents directly in the catalog file entry (see Apple TechNote 1150 – which was updated in March 2004 with information on the journal format). If the file has more than 8 extents, a lookup then needs to be done into the extents overflow file. Interestingly enough, in MacOS X 10.4 and above (i think it was 10.4… may have been 10.3 as well) if a file is less than 20MB and has more than 8 extents, on an open, the OS will automatically try to defragment that file. Arguably you should just fix your allocation strategy, but hey – maybe this does actually help.
File systems such as ext2, ext3 and reiserfs just store a list of block numbers. In the case of ext2 and ext3, the futher into a file you are, the more steps are required to find the disk block number associated with that block in the file.
So what does an extent actually look like? Well, for XFS, the following excerpt from xfs_bmap_btree.h is interesting:
#define ISUNWRITTEN(x) ((x)->br_state == XFS_EXT_UNWRITTEN)
typedef struct xfs_bmbt_irec
{
xfs_fileoff_t br_startoff; /* starting file offset */
xfs_fsblock_t br_startblock; /* starting block number */
xfs_filblks_t br_blockcount; /* number of blocks */
xfs_exntst_t br_state; /* extent state */
} xfs_bmbt_irec_t;
It’s also rather self explanetry. Holes (for sparse files) in XFS don’t have extents, and an extent doesn’t have to have been written to disk. This allows you to preallocate space in chunks without having written anything to it. Reading from an unwritten extent gets you zeros (otherwise it would be a security hole!).
memberdb/log.MYD: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..943]: 5898248..5899191 3 (36536..37479) 944 1: [944..1023]: 6071640..6071719 3 (209928..210007) 80 2: [1024..1127]: 6093664..6093767 3 (231952..232055) 104 3: [1128..1279]: 6074800..6074951 3 (213088..213239) 152 4: [1280..1407]: 6074672..6074799 3 (212960..213087) 128 5: [1408..1423]: 6074264..6074279 3 (212552..212567) 16 memberdb/log.MYI: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..7]: 10165832..10165839 5 (396312..396319) 8
The interesting thing about this is that the log table grows very slowly. This table stores a bunch of debugging output for my memberdb applicaiton. It should possibly be a partitioned ARCHIVE table (and probably will in the future).
The thing about a file growing slowly over time is that it’s more likely to have more than 1 extent (I’ll examine why in the near future).
My InnoDB data and log files only have 1 extent.. I think I’ve done a xfs_fsr on my file system though.
(a little while ago I was writing a really long entry on everything possible. I realised that this would be a long read for people and that less people would look at it, so I’ve split it up).
This sprung out of doing work on the NDB disk data tree. Anything where efficient use of the filesystem is concerned tickles my fancy, so I went to have a look at what was going on.
Filesystems store what part of the disk belongs to what file in one of two ways. The first is to keep a list of every disk block (typically 4kb) that’s being used by the file. A 400kb file will have 100 block numbers. The second way is to store a range (extent). That is, a 400kb file could use 100 blocks starting at disk block number 1000.
XFS has a tool called xfs_bmap. It gives you a list of the extents allocated to a file.
So, let’s have a look at what it tells us about some recordings on my MythTV box.
myth@orpheus:~$ ls -lah myth-recordings/10_20050912183000_20050912190000.nuv -rw-r--r-- 1 myth myth 452M 2005-09-12 19:00 myth-recordings/10_20050912183000_20050912190000.nuv myth@orpheus:~$ xfs_bmap -v myth-recordings/10_20050912183000_20050912190000.nuv myth-recordings/10_20050912183000_20050912190000.nuv: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..639]: 228712176..228712815 7 (21106232..21106871) 640 1: [640..1663]: 83674040..83675063 2 (24358056..24359079) 1024 2: [1664..923519]: 83675368..84597223 2 (24359384..25281239) 921856 3: [923520..924031]: 84631272..84631783 2 (25315288..25315799) 512
Just to make things fun, this is all in 512byte blocks. But anyway, the real interesting thing is the number of extents. Ideally, every file would have one extent as this means that we avoid disk seeks – *the* most expensive disk operation.
XFS also provides the xfs_fsr tool (File System Repacker) that can defragment files (even on a mounted file system). On IRIX this used to run out of cron – fun when a bunch of machines hit a CXFS volume all at the same time.
arjen_lentz: MySQL thread cache
It should be noted however that creating and destroying threads on some platforms is a very very cheap operation. Linux with NPTL (esp on x86) is one such platform.
(even without NPTL on x86 it’s stil pretty cheap).
On PPC with LinuxThreads it’s quite expensive.
On PPC with MacOS X it’s also very expensive.
I think i’ve blogged about this previously.
But users on MacOS X, Windows and Linux without NPTL should certainly consider using the thread cache. Otherwise, if you’re on x86 with NPTL you probably don’t have to bother – or at least you notice a very small benefit.
struct __packed s { … }This attribute tells GCC that a type or variable should be packed into memory, using the minimum amount of space possible, potentially disregarding alignment requirements. If specified on a struct or union, all variables therein are so packed. If specified on just a specific variable, only that type is packed. As an example, a structure with a char followed by an int would most likely find the integer aligned to a memory address not immediately following the char (say, three bytes later). The compiler does this by inserting three bytes of unused packing between the two variables. A packed structure lacks this packing, potentially consuming less memory but failing to meet architecture alignment requirements.
It should also be noted that non aligned data access on some architectures (e.g. ppc) can totally cripple performance. We’re talking orders of magnatude here. IBM has a good article on their developer site.
http://www-128.ibm.com/developerworks/power/library/pa-dalign/
Apple has some good tools for this too – and docs if i remember correctly.
Blog | rml talks about a bunch of useful GCC extensions.
We generally don’t use this within mysql code. Due (no doubt) to portability issues. Maybe we should look closer at it these days. I wonder if we’d get any noticable improvement in NDB by adding it to our ndbrequire/ndbassert and CRASH_INSERTION tests. In some areas of code we do have a number of asserts.
The place to play is ndb/src/kernel/vm/pc.hpp
as actually triggering a ndbassert or ndbrequire is something that should never happen, unlikely() is a good thing to put there.
Interestingly enough, this is probably the place we also want to play with for dtrace. either that or with EXECUTE_DIRECT (and another place that Mikael mentioned on IRC last night.
My next task is to get qemu network interface working so i can get the source across to my VM of Solaris 10 and then start playing.
Of course, the other option is to actually install it somewhere (or shell out for vmware). It would be a lot faster then though.
PortaWiki is going pretty well. We’ve got a couple of contribututors at the moment and getting good little bits on the various oddities of various platforms. I encourage you to check it out and add things that you know.
It’d be great to have a MySQL section there too. In versions previous to 5.0 for example, you may get different results from some math operations on different platforms as we used the floating point stuff. In 5.0 we have precision math so this isn’t a problem – but it probably caused somebody to raise an eyebrow in the past. Volunteers?
it’s annoying. grr.
but, on the other hand, I am speaking about MySQL 5.0 at OSDC.
This is even cooler as 5.0 has gone GA. So it’s not “upcoming features” it’s the “here and now”.
I’ll now have to release MemberDB 0.4 (the MySQL release). Converting the Linux Australia installation over at some point soon too. The 0.4 tree fixes enough bugs that it’s worth it (one of which Pia found the other day).
We at MySQL AB have unleased MySQL 5.0 upon the world. It’s now declared GA (stable) and recommended for use everywhere you can possibly fit it (yes, this means brain implants and other things we dare not mention).
On DevZone there’s also a photo titled “MySQL 5.0 Development Team” taken at the DevConf earlier this year in Prague (you can see the pretty buildings in the background). I have a bunh of nice photos from there. I plan to put the scenic ones up somewhere at some point.
There’s even a poll for your favourite new feature. There isn’t an option for “version number divisible by 5”, but hey :)
I am going to get up and dance.
This does mean I will look stupid, but it’s dancing towards ice cream. Everybody deservers ice cream. Especially those with MySQL 5.0. In fact, if you don’t have it – no ice cream for you. You know you want ice cream…
At AUUG2005 last week, Arjen, myself and others were discussing the idea of trying to assemble some sort of common resources that multiple projects can use to contribute and find out about portability issues they stumble across.
The idea being that we can all then learn from each other and write better, more portable software.
So, I’ve set something up.
I present, the incredibly bare (okay, not quite completely bare) PortaWiki.
Please add whatever stuff you find, you know or anything. No idea how this is going to work – I plan to let it evolve.
(Arjen tells me that Peter Gutmann should receive credit as he thinks he came up with the idea. Kudos to him).
I’m currently watching a Solaris 10 install under QEMU on my laptop. It seems to be taking a while, but getting there.
(I got a Solaris 10 DVD in my AUUG shwag)
Basically, I want to play with DTrace and see how easy it is to do things with it. Solaris seems to be the requirement. I don’t want to have a partition for it nor run it as a primary OS. So, qemu it is.
I can also then use the funky disk image foo with qemu so that i don’t waste a lot of space (mmm… sparse disk images).
For a 7GB qemu-img created filesystem, used intirely as /, it seems that there’s 128MB overhead for having the file system. The installer is chugging away writing things and this seems to be constant.
So, all in all i should end up using a bit less than 3GB of real disk space for a full Solaris 10 install in a qemu image.
I can now give presentations from my laptop – yay.
It requires running the ATI binary drivers instead of the open source ones.
Then VGA out works without being squiggly. (that’s on my Asus V6V laptop with a Radeon X600 running Ubuntu Breezy) – there’ that should be enough google juice.
However, as if being binary only wasn’t crappy enough – suspend doesn’t work. So it’s open source drivers for all other times! I don’t use GL, so that doesn’t worry me. Of course, it may start to worry me what with all the neat cairo stuff and other accelleration coming… but not yet.
This should come in handy for the Melbourne MySQL Users Group meeting tomorrow night!