disk space allocation (part 4: allocating an extent)

For XFS, in normal operation, an extent is only allocated when data has to be written to disk. This is called delayed allocation. If we are extending a file by 50MB – that space is deducted from the total free space on the filesystem, but no decision on where to place that data is made until we start writing it out – due to memory pressure or the kernel automatically starts writing the dirty pages out (the sync once every 5 seconds on linux).

When an extent needs to be allocated, XFS looks it up in one of two b+trees it has of free space. There is one sorted by starting block number (so you can search for “an extent near here”) and one by size (so you can search for “an extent of x size”).

The ideal situation being that you want as large an extent as possible as close to the tail end of the file as possible (i.e. just making the current extent bigger).

The worst-case scenario is having to allocate extents to multiple files at once with all of them being written out synchronously (O_SYNC or memory pressure) as this will cause lots of small extents to be created.

disk space allocation (part 3: storing extents on disk)

Here I’m going to talk about how file systems store what part of the disk a part of the file occupies. If your database files are very fragmented, performance will suffer. How much depends on a number of things however.

XFS can store some extents directly in the inode (see xfs_dinode.h). If I’m reading things correctly, this can be 2 extents per fork (data fork and attribute fork). If more than this number of extents are needed, a btree is used instead.

HFS/HFS+ can store up to 8 extents directly in the catalog file entry (see Apple TechNote 1150 – which was updated in March 2004 with information on the journal format). If the file has more than 8 extents, a lookup then needs to be done into the extents overflow file. Interestingly enough, in MacOS X 10.4 and above (i think it was 10.4… may have been 10.3 as well) if a file is less than 20MB and has more than 8 extents, on an open, the OS will automatically try to defragment that file. Arguably you should just fix your allocation strategy, but hey – maybe this does actually help.

File systems such as ext2, ext3 and reiserfs just store a list of block numbers. In the case of ext2 and ext3, the futher into a file you are, the more steps are required to find the disk block number associated with that block in the file.

So what does an extent actually look like? Well, for XFS, the following excerpt from xfs_bmap_btree.h is interesting:

#define ISUNWRITTEN(x) ((x)->br_state == XFS_EXT_UNWRITTEN)

typedef struct xfs_bmbt_irec
{
xfs_fileoff_t br_startoff; /* starting file offset */
xfs_fsblock_t br_startblock; /* starting block number */
xfs_filblks_t br_blockcount; /* number of blocks */
xfs_exntst_t br_state; /* extent state */
} xfs_bmbt_irec_t;

It’s also rather self explanetry. Holes (for sparse files) in XFS don’t have extents, and an extent doesn’t have to have been written to disk. This allows you to preallocate space in chunks without having written anything to it. Reading from an unwritten extent gets you zeros (otherwise it would be a security hole!).

disk space allocation (part 2: examining your database files)

memberdb/log.MYD:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..943]:        5898248..5899191  3 (36536..37479)     944
   1: [944..1023]:     6071640..6071719  3 (209928..210007)    80
   2: [1024..1127]:    6093664..6093767  3 (231952..232055)   104
   3: [1128..1279]:    6074800..6074951  3 (213088..213239)   152
   4: [1280..1407]:    6074672..6074799  3 (212960..213087)   128
   5: [1408..1423]:    6074264..6074279  3 (212552..212567)    16
memberdb/log.MYI:
 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL
   0: [0..7]:          10165832..10165839  5 (396312..396319)     8

The interesting thing about this is that the log table grows very slowly. This table stores a bunch of debugging output for my memberdb applicaiton. It should possibly be a partitioned ARCHIVE table (and probably will in the future).

The thing about a file growing slowly over time is that it’s more likely to have more than 1 extent (I’ll examine why in the near future).

My InnoDB data and log files only have 1 extent.. I think I’ve done a xfs_fsr on my file system though.

disk space allocation (part 1: seeing what’s happenned)

(a little while ago I was writing a really long entry on everything possible. I realised that this would be a long read for people and that less people would look at it, so I’ve split it up).

This sprung out of doing work on the NDB disk data tree. Anything where efficient use of the filesystem is concerned tickles my fancy, so I went to have a look at what was going on.

Filesystems store what part of the disk belongs to what file in one of two ways. The first is to keep a list of every disk block (typically 4kb) that’s being used by the file. A 400kb file will have 100 block numbers. The second way is to store a range (extent). That is, a 400kb file could use 100 blocks starting at disk block number 1000.

XFS has a tool called xfs_bmap. It gives you a list of the extents allocated to a file.

So, let’s have a look at what it tells us about some recordings on my MythTV box.

myth@orpheus:~$ ls -lah myth-recordings/10_20050912183000_20050912190000.nuv
 -rw-r--r--  1 myth myth 452M 2005-09-12 19:00 myth-recordings/10_20050912183000_20050912190000.nuv
myth@orpheus:~$ xfs_bmap -v myth-recordings/10_20050912183000_20050912190000.nuv
myth-recordings/10_20050912183000_20050912190000.nuv:
 EXT: FILE-OFFSET       BLOCK-RANGE          AG AG-OFFSET             TOTAL
   0: [0..639]:         228712176..228712815  7 (21106232..21106871)    640
   1: [640..1663]:      83674040..83675063    2 (24358056..24359079)   1024
   2: [1664..923519]:   83675368..84597223    2 (24359384..25281239) 921856
   3: [923520..924031]: 84631272..84631783    2 (25315288..25315799)    512

Just to make things fun, this is all in 512byte blocks. But anyway, the real interesting thing is the number of extents. Ideally, every file would have one extent as this means that we avoid disk seeks – *the* most expensive disk operation.

XFS also provides the xfs_fsr tool (File System Repacker) that can defragment files (even on a mounted file system). On IRIX this used to run out of cron – fun when a bunch of machines hit a CXFS volume all at the same time.

arjen_lentz: MySQL thread cache

arjen_lentz: MySQL thread cache

It should be noted however that creating and destroying threads on some platforms is a very very cheap operation. Linux with NPTL (esp on x86) is one such platform.

(even without NPTL on x86 it’s stil pretty cheap).

On PPC with LinuxThreads it’s quite expensive.

On PPC with MacOS X it’s also very expensive.

I think i’ve blogged about this previously.

But users on MacOS X, Windows and Linux without NPTL should certainly consider using the thread cache. Otherwise, if you’re on x86 with NPTL you probably don’t have to bother – or at least you notice a very small benefit.

__packed

Blog | rml

struct __packed s { … }This attribute tells GCC that a type or variable should be packed into memory, using the minimum amount of space possible, potentially disregarding alignment requirements. If specified on a struct or union, all variables therein are so packed. If specified on just a specific variable, only that type is packed. As an example, a structure with a char followed by an int would most likely find the integer aligned to a memory address not immediately following the char (say, three bytes later). The compiler does this by inserting three bytes of unused packing between the two variables. A packed structure lacks this packing, potentially consuming less memory but failing to meet architecture alignment requirements.

It should also be noted that non aligned data access on some architectures (e.g. ppc) can totally cripple performance. We’re talking orders of magnatude here. IBM has a good article on their developer site.

http://www-128.ibm.com/developerworks/power/library/pa-dalign/

Apple has some good tools for this too – and docs if i remember correctly.

rml on GCC extensions (and making them portable)

Blog | rml talks about a bunch of useful GCC extensions.

We generally don’t use this within mysql code. Due (no doubt) to portability issues. Maybe we should look closer at it these days. I wonder if we’d get any noticable improvement in NDB by adding it to our ndbrequire/ndbassert and CRASH_INSERTION tests. In some areas of code we do have a number of asserts.

The place to play is ndb/src/kernel/vm/pc.hpp

as actually triggering a ndbassert or ndbrequire is something that should never happen, unlikely() is a good thing to put there.

Interestingly enough, this is probably the place we also want to play with for dtrace. either that or with EXECUTE_DIRECT (and another place that Mikael mentioned on IRC last night.

My next task is to get qemu network interface working so i can get the source across to my VM of Solaris 10 and then start playing.

Of course, the other option is to actually install it somewhere (or shell out for vmware). It would be a lot faster then though.

PortaWiki going well (Wiki for portability issues)

PortaWiki is going pretty well. We’ve got a couple of contribututors at the moment and getting good little bits on the various oddities of various platforms. I encourage you to check it out and add things that you know.

It’d be great to have a MySQL section there too. In versions previous to 5.0 for example, you may get different results from some math operations on different platforms as we used the floating point stuff. In 5.0 we have precision math so this isn’t a problem – but it probably caused somebody to raise an eyebrow in the past. Volunteers?

must be time to use the OSDC conference registration/paper submission site

it’s annoying. grr.

but, on the other hand, I am speaking about MySQL 5.0 at OSDC.

This is even cooler as 5.0 has gone GA. So it’s not “upcoming features” it’s the “here and now”.

I’ll now have to release MemberDB 0.4 (the MySQL release). Converting the Linux Australia installation over at some point soon too. The 0.4 tree fixes enough bugs that it’s worth it (one of which Pia found the other day).

MySQL 5.0 is GA

We at MySQL AB have unleased MySQL 5.0 upon the world. It’s now declared GA (stable) and recommended for use everywhere you can possibly fit it (yes, this means brain implants and other things we dare not mention).

On DevZone there’s also a photo titled “MySQL 5.0 Development Team” taken at the DevConf earlier this year in Prague (you can see the pretty buildings in the background). I have a bunh of nice photos from there. I plan to put the scenic ones up somewhere at some point.

There’s even a poll for your favourite new feature. There isn’t an option for “version number divisible by 5”, but hey :)

I am going to get up and dance.

This does mean I will look stupid, but it’s dancing towards ice cream. Everybody deservers ice cream. Especially those with MySQL 5.0. In fact, if you don’t have it – no ice cream for you. You know you want ice cream…

PortaWiki – collaboration on portability issues

At AUUG2005 last week, Arjen, myself and others were discussing the idea of trying to assemble some sort of common resources that multiple projects can use to contribute and find out about portability issues they stumble across.

The idea being that we can all then learn from each other and write better, more portable software.

So, I’ve set something up.

I present, the incredibly bare (okay, not quite completely bare) PortaWiki.

Please add whatever stuff you find, you know or anything. No idea how this is going to work – I plan to let it evolve.

(Arjen tells me that Peter Gutmann should receive credit as he thinks he came up with the idea. Kudos to him).

http://www.flamingspork.com/portawiki/

Solaris 10 under QEMU

I’m currently watching a Solaris 10 install under QEMU on my laptop. It seems to be taking a while, but getting there.

(I got a Solaris 10 DVD in my AUUG shwag)

Basically, I want to play with DTrace and see how easy it is to do things with it. Solaris seems to be the requirement. I don’t want to have a partition for it nor run it as a primary OS. So, qemu it is.

I can also then use the funky disk image foo with qemu so that i don’t waste a lot of space (mmm… sparse disk images).

For a 7GB qemu-img created filesystem, used intirely as /, it seems that there’s 128MB overhead for having the file system. The installer is chugging away writing things and this seems to be constant.

So, all in all i should end up using a bit less than 3GB of real disk space for a full Solaris 10 install in a qemu image.

VGA Out and presentations

I can now give presentations from my laptop – yay.

It requires running the ATI binary drivers instead of the open source ones.

Then VGA out works without being squiggly. (that’s on my Asus V6V laptop with a Radeon X600 running Ubuntu Breezy) – there’ that should be enough google juice.

However, as if being binary only wasn’t crappy enough – suspend doesn’t work. So it’s open source drivers for all other times! I don’t use GL, so that doesn’t worry me. Of course, it may start to worry me what with all the neat cairo stuff and other accelleration coming… but not yet.

This should come in handy for the Melbourne MySQL Users Group meeting tomorrow night!

a funky thing done last week…

still have to talk to people about standards for this sort of thing and all that. But as a first checkin – funkyness++!

mysql> select * from INFORMATION_SCHEMA.DATAFILES;  select * 
from INFORMATION_SCHEMA.TABLESPACES;
Empty set (0.03 sec)

Empty set (0.00 sec)

mysql> CREATE TABLESPACE ts1 ADD DATAFILE 'datafile.dat' USE 
LOGFILE GROUP lg1 INITIAL_SIZE = 12M ENGINE NDB;
Query OK, 0 rows affected (2.35 sec)

mysql> select * from INFORMATION_SCHEMA.DATAFILES;  select * 
from INFORMATION_SCHEMA.TABLESPACES;
+--------------+--------+--------------+----------+------+------------+
| NAME         | ENGINE | PATH         | SIZE     | FREE | TABLESPACE |
+--------------+--------+--------------+----------+------+------------+
| datafile.dat | NDB    | datafile.dat | 12582912 |   11 |            |
+--------------+--------+--------------+----------+------+------------+
1 row in set (0.00 sec)

+------+--------+---------+-------------+-----------------------+
| NAME | ENGINE | VERSION | EXTENT_SIZE | DEFAULT_LOGFILE_GROUP |
+------+--------+---------+-------------+-----------------------+
| ts1  | NDB    |       1 |     1048576 |                     0 |
+------+--------+---------+-------------+-----------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLE t1 
(pk1 int not null primary key auto_increment,
 b int not null, 
c int not null) 
tablespace ts1 storage disk engine ndb; 
Query OK, 0 rows affected (0.62 sec)

mysql> insert into t1 (b,c) values (1,2),(2,3),(3,4),(1,2),(2,3),(3,4),(1,2),(2,3),(3,4),
(1,2),(2,3),(3,4),(1,2),(2,3),(3,4),(1,2),(2,3),(3,4),(1,2),(2,3),(3,4),(1,2),
(2,3),(3,4),(1,2),(2,3),(3,4),(1,2),(2,3),(3,4),(1,2),(2,3),(3,4),(1,2),(2,3),(3,4);
Query OK, 36 rows affected (0.11 sec)
Records: 36  Duplicates: 0  Warnings: 0

mysql> select * from INFORMATION_SCHEMA.DATAFILES;  select * 
from INFORMATION_SCHEMA.TABLESPACES; 
+--------------+--------+--------------+----------+------+------------+ 
| NAME         | ENGINE | PATH         | SIZE     | FREE | TABLESPACE |
+--------------+--------+--------------+----------+------+------------+
| datafile.dat | NDB    | datafile.dat | 12582912 |    9 |            |
+--------------+--------+--------------+----------+------+------------+
1 row in set (0.02 sec)

+------+--------+---------+-------------+-----------------------+
| NAME | ENGINE | VERSION | EXTENT_SIZE | DEFAULT_LOGFILE_GROUP |
+------+--------+---------+-------------+-----------------------+
| ts1  | NDB    |       1 |     1048576 |                     0 |
+------+--------+---------+-------------+-----------------------+
1 row in set (0.00 sec)

mysql> CREATE TABLESPACE ts2 ADD DATAFILE 'datafile2.dat' 
USE LOGFILE GROUP lg1 INITIAL_SIZE = 12M ENGINE NDB;
Query OK, 0 rows affected (2.18 sec)

mysql> select * from INFORMATION_SCHEMA.DATAFILES;  select * 
from INFORMATION_SCHEMA.TABLESPACES;
+---------------+--------+---------------+----------+------+------------+
| NAME          | ENGINE | PATH          | SIZE     | FREE | TABLESPACE |
+---------------+--------+---------------+----------+------+------------+
| datafile2.dat | NDB    | datafile2.dat | 12582912 |   11 |            |
| datafile.dat  | NDB    | datafile.dat  | 12582912 |    9 |            |
+---------------+--------+---------------+----------+------+------------+
2 rows in set (0.02 sec)

+------+--------+---------+-------------+-----------------------+
| NAME | ENGINE | VERSION | EXTENT_SIZE | DEFAULT_LOGFILE_GROUP |
+------+--------+---------+-------------+-----------------------+
| ts1  | NDB    |       1 |     1048576 |                     0 |
| ts2  | NDB    |       1 |     1048576 |                     0 |
+------+--------+---------+-------------+-----------------------+
2 rows in set (0.00 sec)

mysql> ALTER TABLESPACE ts1 ADD DATAFILE 'datafile3.dat' 
INITIAL_SIZE=12M ENGINE NDB;
Query OK, 0 rows affected (1.85 sec)

mysql> select * from INFORMATION_SCHEMA.DATAFILES;  select *
 from INFORMATION_SCHEMA.TABLESPACES;
+---------------+--------+---------------+----------+------+------------+
| NAME          | ENGINE | PATH          | SIZE     | FREE | TABLESPACE |
+---------------+--------+---------------+----------+------+------------+
| datafile2.dat | NDB    | datafile2.dat | 12582912 |   11 |            |
| datafile3.dat | NDB    | datafile3.dat | 12582912 |   11 |            |
| datafile.dat  | NDB    | datafile.dat  | 12582912 |    9 |            |
+---------------+--------+---------------+----------+------+------------+
3 rows in set (0.02 sec)

+------+--------+---------+-------------+-----------------------+
| NAME | ENGINE | VERSION | EXTENT_SIZE | DEFAULT_LOGFILE_GROUP |
+------+--------+---------+-------------+-----------------------+
| ts1  | NDB    |       1 |     1048576 |                     0 |
| ts2  | NDB    |       1 |     1048576 |                     0 |
+------+--------+---------+-------------+-----------------------+
2 rows in set (0.00 sec)

the ‘free’ column is really the number of free extents. Not exactly ideal… maybe… but since that’s the unit of allocation in the data files, it sort of makes sense. The other option is to list number of extents * extent size. Maybe that’s clearer for people… there is the option of denormalising the tables and have extent size in the DATAFILES table too. There is something in my brain that makes that a hard leap though.

Although…. if you’re going to be querying the tables directly and not just using a pretty gui on top of it all, you should probably know what you’re doing anyway.

Although, both a great benefit (and curse) of commoditising the database market is the fact that you get all sorts as users. This is interesting in cluster as it is naturally a bit more complex than a simple client-server RDBMS.

we also need a NODE column as well. which will probably cause confusion for non-cluster users and the like :)

(for the unintiated, this is work being done in a branch off the 5.1 tree for NDB disk data. we’ll push it to the main 5.1 tree at some point). don’t go thinking this is production ready any time soon (in other words insert a standard disclaimer).

Melbourne MySQL User Group

I’m getting responses of people wanting to come to the next meeting. This is all good. Looks like we’ll have more people than last time (which is good). So growing it is.

Considering that a lot of people got information on the last one less than a week before the event, and this time it’s about two weeks – I feel good about it.

Full details at: http://mysql.meetup.com/93/

the bastedo blog – replication in mysql 5

the bastedo blog

(I did matching versions [5.x] don’t know how diff versions will work)

Setting up a slave with a newer version of MySQL is quite a common setup. It has a couple of advantages:
– it lests you test a new version before deploying on the master (to test that everything goes smoothly)
– it lets you test new major versions (e.g. 5.0) before they are released GA (helps find bugs that may affect your setup).

I know at least one customer generally has a slave runnin the latest BK tree – just to be sure that nothing is going to even potentially break for them. Kudos to them :)

Having a slave that you use for backups is a great idea. No extra load on the master (i.e. you can safely stop the db on the slave and back things up quickly – without having locks held on your master!).

Also, if your master suffers a meltdown, you have a recent live backup system ready to take its place!

Microsoft loses in Eolas patent ruling | CNET News.com

Microsoft loses in Eolas patent ruling | CNET News.com

Come on Microsoft – join us in the fight against software patents. This clearly hurts the entire industry – be it big vendors like yourself or small ones.

Let’s not all get royally screwed.

MySQL Melbourne Meetup

When:
Tuesday September 13th 7:00pm

Where:
Miro International Pty Ltd
Level 18, 31 Queen Street
Melbourne 3000, Australia

Thanks to Miro for offering their offices for the meeting.

What’s happening?
Stewart Smith will talk about:
– What’s new in MySQL land
– Introduction to MySQL AB the company
– what it does
– what it offers
– Graphical tools for MySQL
– MySQL Administrator
– MySQL Query Browser

RSVP
Please RSVP via our meetup.com site:
http://mysql.meetup.com/93/

After
We can head to a pub or out for curry.