Tuesday, May 09, 2006

Update on MySQL Issues

Last week I wrote about a strange issue with MySQL replication and what I thought I wouldn't see again for a long time. Well, around 1:00am this morning another MySQL server (a different cluster than previous) had the exact same issue as the first cluster. This time, my first step was to reboot the machine to see if it would fix the problem and thus not require me to spend a lot of time exporting and importing data. The reboot fixed the problem with the constant restart of the mysql server, but the replication thread failed to start. In fact, it was failing to start the slave thread on both machines. Issuing start slave produced this in the error log of the machine with the problem:


060509 1:32:09 [Note] Slave I/O thread: connected to master 'replxxx@xxx-slave1:3306', replication started in log 'xxx-slave1-bin.004981' at position 79
060509 1:32:09 [ERROR] Error reading packet from server: Could not find first log file name in binary log index file (server_errno=1236)
060509 1:32:09 [ERROR] Got fatal error 1236: 'Could not find first log file name in binary log index file' from master when reading data from binary log
060509 1:32:09 [ERROR] Slave I/O thread exiting, read up to log 'xxx-slave1-bin.004981', position 79


This time I was determined to fix the problem without dumping data so I Googled a bit for the error and found this great post:

http://archives.neohapsis.com/archives/mysql/2004-q1/2000.html

Ah, so after creating the missing file and issuing a reset slave the problem was fixed. On the other master I simply had to run a change master to…’ command and start slave and now replication was working in both directions again.

I think there is definitely a bug in the MySQL code somewhere, but Im not sure what triggers it. I dont think its related to the number of days a server has been up, but it might be related to the number of connections that have been opened since it was last restarted. It may just be a corrupt relay log file.