I've been researching MySQL cluster today and my brain is on a bit of information overload so I thought it would be wise to jot down some key points I have learned thus far.
MySQL cluster differs from a traditional MySQL installment because it is designed to work across multiple machines. In the traditional setup you install MySQL on one node and have several slaves that replicate the data and serve as backup servers in case of failure. In the cluster scenario there are a minimal of 4 machines (technically 3, but if you care at all about your data you will need 4). Machines fall into one of three categories:
1. Management Nodes
Management nodes are important for defining the roles of each of the other nodes in your cluster. Here you setup config files that the other servers read to understand how the cluster is setup. You would need at most two of these, one just sitting there in case the primary fails. If the primary does fail, it will not cause the cluster to stop working.
2. MySQL Client Nodes
Client nodes are the machines that actually perform all of the queries received from clients. You can setup as many of these as you load demands, preferably at least two in case of failure. In MySQL 5.1 these nodes do perform caching, but there are some gotchas that you should read in the documentation.
3. Data Nodes
The nodes use the NDB storage engine to replicate your data across multiple machines. This is layer I have the most questions about and I'm trying to find out more. Prior to 5.1 all database data, including indexes was stored in memory. So if you have large tables you need to make sure your data (and indexes) will fit into the available memory. A new feature of 5.1 is the ability to store only the primary key information in memory and store the rest of the data on disk. This will help a lot of really large data sets.
So if you had 2 management nodes, 2 client nodes, and 2 data nodes any of the nodes in the cluster could be shut down and there would be no loss of uptime or data.
A nice feature added to 5.1 is the ability to replicate clusters. So if you are paranoid and afraid that an entire cluster could go down then you could fail over to the secondary cluster. I didn't read the entire doc on this topic, but it seems complicated and probably overkill for most situations.
It's important to understand how indexes are implemented in the cluster scenario. Primary keys are implemented as a hash index type. They do this so they can partition your data based on the hash. This brings up two key points that must be understood when looking into the clustering:
The following queries will work great and be very fast (from the performance pdf available online):
SELECT * FROM tbl WHERE pk = 5;
SELECT * FROM tbl WHERE pk = 'ABC' or pk = 'XYZ';
The following queries will not use the primary key index:
SELECT * FROM tbl WHERE pk < 100;
SELECT * FROM tbl WHERE pk_part1 = 'ABC';
The last example is especially important to understand. If you have a primary key that uses multiple columns (pk_part1, pk_part2) you must specify the value of both columns in order for the index to be used.
Ordered indexes are implemented using T-tree's instead of B-tree's. This is for memory saving purposes and the need to not have to worry about disk-seek times (everything is in memory). This really should make any difference to the client but I mention here because I thought it was a great idea.
Data retrieval in a clustered scenario is very important to understand. If you do a query based on the primary key, the client node can go directly to the data node that has your row and retrieve it. If you do a query which results in the client having to use an ordered index, the client must query in parallel every data node in the cluster. This may be really fast, but would result in much more network traffic than a primary key lookup. The worst kind of query would be one where the client node must do a full table scan. In this scenario the query must be sent to all nodes. Prior to version 5.0 the "where" clause was not sent to the data nodes so that meant that the entire data set was returned to the client for scanning. If you had 2 million rows, all 2 million would be sent over the network for scanning. Since 5.0 you can enable an option to perform the "where" on each data node and only return the rows that match the where clause.
I think picking a primary key is especially important when implementing a table that will reside in the cluster. Pick the wrong key and I think your queries will be much slower than the traditional database. On that note, it's not uncommon for a clustered database to be slower than a traditional database. If you look at everything that must happen to perform a query in the clustered scenario that makes sense. But I think the clustered database kicks but in the ability to maximize throughput.
As I mentioned earlier, my main questions so far are in regards to the data nodes. I haven't found any documentation that covers how data is spread across the nodes. For instance, if I had a configuration with 50 data nodes, and I have a database with 100 million rows, how much of the data would be on each node? Is the data triple replicated? Double replicated? Which data nodes are safe to shut down at the same time? Does anyone out there know or could point me in the right direction?
