Monday, March 22, 2010

Replacing the hardware of the NIS+ master

The file system on the NIS+ master had become corrupt.

Not the disks. The root fs and swap were both mirrored with disksuite. And metastat reported everything to be in order!

But files and directories all over the shop were "missing" - actually I/O error was reported on the command line -and a quick look in the messages file revealed that the ugly truth.

There are a couple of good sites out on the internet which describe how to recover from this sort of problem:
SUN
Solaris FAQ

Luckily, I have a script which runs several times a day which copies off the NIS_COLD_START, passwd, shadow and .rootkey files from /etc and executes a nisbackup -a command and tar the whole lot up into a file on a file server. The frequency with which this script is run depends upon the TTL of the domain. My domain still has the default of 12 hours, so in theory the script only needs to run twice a day. But 3 or 4 times would be better. I'm not sure it is worthwhile keeping all of these, but the last couple, maybe.

As luck would have it, The server crashed running this script. Just after writing the tarfile to the fileserver. The script also tests the validity of the backup by extracting all the files to a temporary directory. I guess the corrupt filesystem just decided it couldn't handle that.

I grabbed hold of an old Sun Blade 150 that was in the store cupboard, for just such an eventuality and changed its identity to be the same as the failed NIS+ master. I changed
/etc/hosts
/etc/hostname.eri0
/etc/nodename
/etc/net/ticlts/hosts
/etc/net/ticots/hosts
/etc/net/ticotsord/hosts
and entered hostname

I ftp-ed my tarfile backup into /tmp and untar-ed it.

I copied the passwd, shadow and .rootkey files into /etc overwriting any existing files.

And then I entered nisrestore -f -a /tmp/nisplus_backup ignoring any output.

And then I shutdown the server, moved it into the racks and restarted the server. 

As a test on the new server when it was back up, I ran nisping -C -a and also ran the script which backs up all the data.

The backup command failed!

Aarrgghh!

Luckily the problem was clear from the messages file. The directory you tell nisbackup to write to must exist before you enter the command.

Phew!

And that's that!

No comments: