You’d think that if you use RAID1 and multiple redundant, distributed backups with hourly backups, daily backups, etc, you’d be safe? Think again. If your backup software lies to you, you may not realize it until it’s way too late. If you RAID software does not deem it worthy to mention that a disk failed, what good is it?
Backing up our critical data
At Taodyne, one of our tools is a small Mac Mini server that hosts among other things a Redmine database with contacts and bugs. Well, let’s call them “feature requests”, since there can’t possibly be any bug in our amazing 3D presentation software!
Disks on this server already caused us some downtime in the past. The machine runs pretty hot inside. Last time, we recovered from a Time Machine backup, that caused some downtime. So in order to accelerate recovery next time we’d lose a disk, we decided to configure the two internal disks as a RAID1 pair.
The rationale was that whenever we’d lose a disk, which I thought was bound to happen again, we’d suffer much lower downtime, we would just have to rebuild the other disk in the pair. That being said, downtime on this machine will always exist because changing a disk in a Mac Mini is no joke. But at least, with a RAID1 setup, we would be able to defer that until a convenient time and then use the machine while it was rebuilding the disk.
Why bother mentioning there’s a failing disk?
A first problem with the Apple system software is that it does not seem to think that a failing system disk deserves an e-mail. Our little server bugs me on a regular basis with “this certificate is about to expire” or “there’s a new version of iTunes” (on a server???).
But when one of the disks in the pair failed, there was no mention of it. There was nothing obvious in the logs, no mention when we connected to the server. It’s only when I ran Disk Utility for a completely unrelated tasks that I noticed the RAID1 pair was orange.
My system disk is in “degraded” mode and the system does not tell me? Whaaat? I looked at it, and indeed, one disk was missing. I rebuilt it. A while later, the disk was missing again. Time to replace it.
Of course, we also have a Time Machine backup, and I knew from the previous experience that we could recover our databases from it, so I was not overly concerned. We waited a bit to buy a new disk. Don’t do that.
Bad luck and saving grace
We had a stroke of bad luck at that point, and, fortunately, a saving grace.
The stroke of bad luck was that the disk we had been running on, the “good” disk, failed during the rebuild of the RAID pair. We had put a new disk in the other position and started the rebuild. The system started rebuilding, and we felt pretty good. Then it became very slow, alarmingly slow, then stopped completely. Even the mouse was stuck. Force reboot.
At that point, the system would not reboot at all. It turns out the rebuild of the second disk in the pair was not complete, but apparently went far enough that it thought it had a valid system disk and tried to boot from it. I’ll spare you the interesting error messages.
So we rebooted without the new disk. The system let me login, then froze again. At the third boot I did not even reach the login prompt. At that point, I knew my “good” disk was bad too.
The saving grace was that the disk that had disconnected earlier turned out to be surprisingly healthy. We were able to rebuild a functioning RAID1 pair from it. I first rebuilt with a new disk and the old one, then rebuild from the new disk on another new disk. So in the end, I have a RAID1 pair with two new disks. But a disk contents that is a bit dated.
No problem, that’s what Time Machine is for.
Time Machine lies about the dates of the backups
That’s where we ran into our second issue. Some data was missing from the disk since it had disconnected. That was expected. I had hourly backups from the other disk, so I thought that would not be a problem.
And indeed, for most documents, recovery happened fairly well. We recovered documents we had saved minutes before shutting down, as well as a few very recent changes that we could really trust (e.g. git commits, which are cryptographically secure). So we knew that the backup we had was recent and good.
Except for one little thing. One of the applications the server runs is an instance of Redmine. It’s a completely encapsulated Bitnami stack. This means it has its own MySQL database, its own Apache sever, etc.We already restored that several times from Time Machine backups, so we thought nothing of it.
But this time, when I restored from backups, the data was two months old. It took me a while to realize that this was not a problem restoring the database, something I had done several times before.
The problem was that the Time Machine backup labelled as “March 6, 2:20PM” contained a MySQL database dated January 7. Same thing for the Apache logs. Same thing for basically the whole stack. For some reason, it’s only that directory. All others had up to the minute backups. But that specific directory had for some reason been frozen in time two months ago, despite being shown as a recent backup.
Time Machine bug?
Clearly, there is a Time Machine bug involved here, although it has to be a pretty subtle one. On several hundreds gigabytes of data, only that specific directory and its subdirectory show the problem. I have no idea why.
At that point, I’m attempting to rescue data from the failed disk using ddrescue. If that does not work, we’ll have to re-enter a dozen bugs manually (we have mails for them, so we know what they are).
But ultimately, it’s really annoying that with a RAID disk and several backup disks, we still managed to lose (a little bit of) data due to a fault in the backup software itself.
Let’s not get used to faulty backup software. Maybe the problem is gone in more modern versions of MacOSX, but given that the server already ate 3 disks and lost some data, I’m now tempted to switch to Linux.