Saturday, April 29, 2006

MySQL Replication and Heartbeat

Velvet, Marisa, Joel, Robert, and Andrew all did great in their respective 1/2 marathons and full marathons on Saturday. I certainly have a lot of respect for anyone who can endure all of the pain and agony that is involved. After we got back to the hotel I laid down for a power nap before dinner (yes, it was tiring watching all of those runners). I woke up to a phone alert letting me know a MySQL server was acting up. I log into the system and I'm seeing all sorts of errors and continuous notices that MySQL is restarting for every new connection. I've never seen this problem before, so I stop heartbeat so the other master takes over the IP. We run most of our sql servers in a master-master configuration. Master-Master means either server can immediately take over as the master if one goes down.

Thanks to heartbeat and MySQL replication, it allowed me to take my time and debug the problem without any service interruption. As for the debugging, I didn't get very far. I spent about an hour looking through log files, moving directories around (including to a ramdisk to make sure it wasn't a hard drive acting weird), etc. I was about to give the order to re-image the machine and start over when I had Bill reboot the machine because sometimes that magically fixes things. Well guess what, it did. I didn't check the uptime before the reboot, but according the other master it had been up about 284 consecutive days without a reboot. Not bad.

After the reboot fixed the restart problem, I decided to take a few safety precautions and assume the data was corrupt. So I moved the data to a backup copy, dumped everything from the master, re-imported, ran several test, and fired heartbeat back up. Everything looked good, and the whole process took about 4 or 5 hours. The restore took awhile because we have some tables that have over 9 million rows.

So my vacation in Nashville turned into a work weekend thanks to a MySQL server going crazy. Luckily we have the proper monitoring, replication, and failover systems in place, thank you Bill and team!