Thursday, July 22, 2010

Solaris Zone down

I made a minor configuration change to a Solaris zone and rebooted the zone.

On the way back up it hung with an error message about booting.

I entered zoneadm list -cv with the following results

root@server01 # zoneadm list -cv

ID NAME STATUS PATH BRAND IP
0 global running / native shared
1 licence3-zone running /export/home/zones/licence3-zone native shared
2 licence2-zone running /export/home/zones/licence2-zone native shared
5 licence-zone running /export/home/zones/licence-zone native shared
7 licence1-zone running /export/home/zones/licence1-zone native shared
21 la-zone running /export/home/zones/la-zone native shared
23 build-machine down /export/home/zones/build-machine native shared
24 build-2 running /export/home/zones/build-2 native shared
- build-1 installed /export/home/zones/build-1 native shared
root@server01#
Its state was shown as down!

I entered
zoneadm -z build-machine halt
but it failed with a message saying the zone's /tmp directory couldn't be unmounted.

Entering
zoneadm -z build-machine reboot
failed with a similar message.

Entering
zoneadm -z build-machine boot
was the same.

So I traversed the zone's filesystem from the global zone. I'd downloaded some files to the zone's /tmp directory. I deleted the entire contents and entered
zoneadm list -cv
Its state was still shown as down!
I entered zoneadm -z build-macxhine halt, which returned without error. A zoneadm list -cv showed some good news.

root@server01 # zoneadm -z build-machine halt
root@server01 # zoneadm list -cv
ID NAME STATUS PATH BRAND IP
0 global running / native shared
1 licence3-zone running /export/home/zones/licence3-zone native shared
2 licence2-zone running /export/home/zones/licence2-zone native shared
5 licence-zone running /export/home/zones/licence-zone native shared
7 licence1-zone running /export/home/zones/licence1-zone native shared
21 la-zone running /export/home/zones/la-zone native shared
24 build-2 running /export/home/zones/build-2 native shared
- build-machine installed /export/home/zones/build-machine native shared
- build-1 installed /export/home/zones/build-1 native shared
root@server01 #

And it booted successfully. Huzzah!

Well, that's that!

Tuesday, July 20, 2010

ClearCase Funny!

Funny as in peculiar!

This morning an user came up to me. He had tried to create a new view and received an error message.

I logged in as him and got the same error.
cleartool mkview -tag mtest /net/server/views/my_view.vws
cleartool: Error: Unable to contact albd_server on host 'server'
cleartool: Error: Cannot bind an admin_server handle on "server": ClearCase object not found.
cleartool: Error: Unable to create view "/net/server/views/my_view.vws".

...

I googled the error message and it showed up a whole bunch of pages with similar discussions to this IBM support document.

However that wasn't the problem.

For various reasons, the user had set the environment variable ATRIAHOME to what amounted to garbage for his shell. That value had then been used by the cleartool mkview command, with the appropriate error.

So it goes!

Tuesday, April 13, 2010

Blogging Tools

I've been running this blog with some of Blogger's tools for a while now.

Obviously there is Adwords, but those trawling through the HTML (but why would you these days?) will also pick up that I'm using Google Analytics.

Now I turned on Analytics some months ago and the results have been interesting. People are actually reading my blog. Quite a lot. Generally from the US, UK, Canada, Australia and Germany. I had wondered, because there were very few clicks on the Adwords. Perhaps I should not be surprised by that - I only rarely follow an Adwords link on another Blog. I've almost learnt to phase them out automatically. I think it is only the Adwords on GMail I really ever click. I'm beginning to wonder if anyone really clicks on Adwords links anymore.

Now, Google analytics has really shown which posts have and continue to attract attention. It is the VMware posts which attract most attention. In fact for the last month the top 5 are all VMware posts. The highest Solaris post is the one about changing hostids and that is coming in at no.7.

So it goes!

Tuesday, April 6, 2010

NIS+ master Backup script

 This is the script I mentioned using in the my earlier post about replacing the hardware of a NIS+ master.

Having had to use it now, perhaps the only other thing I would add into it would be the root crontab, but that would simply be:
crontab -l >> $log

Obviously, < server > is your server name, i.e. atuin or sunsvr01. And the < off server storage location > is a storage location that is easily accessible when you need it!

# more scripts/nisbackup.sh
#!/bin/sh

log="/var/tmp/nisbackup.log"
date=`date '+%m/%d/%y at %H:%M:%S'`

echo "Starting NIS+ backup on $date" > $log
cp -p /etc/.rootkey /var/nisplus.rootkey.copy
cp -p /var/nis/NIS_COLD_START /var/<server>-NIS_COLD_START
cp -p /etc/shadow /var/<server>-shadow
cp -p /etc/passwd /var/<server>-passwd
/usr/sbin/nisbackup -a /var/nisplus_backup

if [ $? -eq 0 ]; then
  cd /var
  echo "Listing nisplus_backup" >> $log
  ls -l /var/nisplus_backup >> $log
  echo "Creating tar" >> $log
  tar cvf /<off server storage location>/nisplus/backup/nisplusbackup.tar nisplus_backup nisplus.rootkey.copy <server>-NIS_COLD_START <server>-shadow <server>-passwd >> $log
  echo "Checking validity of tar:" >> $log
  tar tvf /apps/admin/nisplus/backup/nisplusbackup.tar >> $log
  else
    echo "nisbackup failed!!!" >> $log
fi

/usr/bin/mailx -s "NIS+ backup" server_admins < $log

#

Monday, April 5, 2010

New Solaris Resource

Whilst searching for information on a specific feature of IP-Filter, I came across a new resource of Solaris Information.

My only worry, other than with Oracle taking SUN over Solaris will wither on the vine, is that a lot of this information may be quite old, if I was solely to judge the site from the Solaris logo!

I'm going to add the site to my useful links. And that's that!

Monday, March 22, 2010

Replacing the hardware of the NIS+ master

The file system on the NIS+ master had become corrupt.

Not the disks. The root fs and swap were both mirrored with disksuite. And metastat reported everything to be in order!

But files and directories all over the shop were "missing" - actually I/O error was reported on the command line -and a quick look in the messages file revealed that the ugly truth.

There are a couple of good sites out on the internet which describe how to recover from this sort of problem:
SUN
Solaris FAQ

Luckily, I have a script which runs several times a day which copies off the NIS_COLD_START, passwd, shadow and .rootkey files from /etc and executes a nisbackup -a command and tar the whole lot up into a file on a file server. The frequency with which this script is run depends upon the TTL of the domain. My domain still has the default of 12 hours, so in theory the script only needs to run twice a day. But 3 or 4 times would be better. I'm not sure it is worthwhile keeping all of these, but the last couple, maybe.

As luck would have it, The server crashed running this script. Just after writing the tarfile to the fileserver. The script also tests the validity of the backup by extracting all the files to a temporary directory. I guess the corrupt filesystem just decided it couldn't handle that.

I grabbed hold of an old Sun Blade 150 that was in the store cupboard, for just such an eventuality and changed its identity to be the same as the failed NIS+ master. I changed
/etc/hosts
/etc/hostname.eri0
/etc/nodename
/etc/net/ticlts/hosts
/etc/net/ticots/hosts
/etc/net/ticotsord/hosts
and entered hostname

I ftp-ed my tarfile backup into /tmp and untar-ed it.

I copied the passwd, shadow and .rootkey files into /etc overwriting any existing files.

And then I entered nisrestore -f -a /tmp/nisplus_backup ignoring any output.

And then I shutdown the server, moved it into the racks and restarted the server. 

As a test on the new server when it was back up, I ran nisping -C -a and also ran the script which backs up all the data.

The backup command failed!

Aarrgghh!

Luckily the problem was clear from the messages file. The directory you tell nisbackup to write to must exist before you enter the command.

Phew!

And that's that!

Sunday, March 21, 2010

Symantec EndPoint Protection

My company uses Symantec Endpoint Protection on all the windows servers. I've known for some time that there was a Linux client, but over the last week Nessus security scans were run against both some really old legacy Solaris servers, the Linux servers and also against the windows servers.

Now the Windows servers were protected by EndPoint and received a clean bill of health.

The Linux servers all have iptables firewalls and SELinux in enforcing mode, and so generated a few false positives, but were generally clean. The worst was that a few web servers hadn't had the TraceEnable Off parameter added to their configuration.

The Solaris servers fared worse. Simply due to their age and the fact that their purpose had been in a development environment.

The thing about EndPoint which I hadn't previously realised was that it detected attempted intrusions and refused further connections from those hosts originating the attacks. In this way it seemed to be operating much much like one of the modes that it was possible to configure into PortSentry. (It is really surprpising to think that the last release of  PortSentry is almost seven years old now!) Consequently, I began lobbying for additional budget to purchase licences for the additional platforms.

The ability to have a single "management station" control the security protection across heterogenous server environment is incredible.

That's that for now!

Saturday, March 20, 2010

Replacing a disk in a Sun D2 Array

The Sun D2 Array is hot swap capable. As we control the D2s using Disksuite, a.k.a. SVM, it is possible to replace a disk in a D2 without requiring that the server be shutdown.

When you receive notification that a disk has died. First run metastat to determine with disk has been swapped out
# metastat
d0: Mirror
    Submirror 0: d10
      State: Okay        
    Submirror 1: d20
      State: Okay        
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 18876726 blocks


d10: Submirror of d0
    State: Okay        
    Size: 18876726 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t0d0s0                   0     No    Okay        




d20: Submirror of d0
    State: Okay        
    Size: 18876726 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t1d0s0                   0     No    Okay        




d1: Mirror
    Submirror 0: d11
      State: Okay        
    Submirror 1: d21
      State: Okay        
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 33555735 blocks


d11: Submirror of d1
    State: Okay        
    Size: 33555735 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t0d0s1                   0     No    Okay        




d21: Submirror of d1
    State: Okay        
    Size: 33555735 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t1d0s1                   0     No    Okay        




d3: Mirror
    Submirror 0: d30
      State: Okay        
    Submirror 1: d31
      State: Okay        
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 355604121 blocks


d30: Submirror of d3
    State: Okay        
    Hot spare pool: hsp000
    Size: 355604121 blocks
    Stripe 0: (interlace: 64 blocks)
        Device              Start Block  Dbase State        Hot Spare
        c2t0d0s0                   0     No    Okay        
        c2t1d0s0                2889     No    Okay        
        c2t2d0s0                2889     No    Okay        
        c2t3d0s0                2889     No    Okay        
        c2t4d0s0                2889     No    Okay        




d31: Submirror of d3
    State: Okay        
    Hot spare pool: hsp000
    Size: 355604121 blocks
    Stripe 0: (interlace: 64 blocks)
        Device              Start Block  Dbase State        Hot Spare
        c2t5d0s0                   0     No    Okay         c2t12d0s0
        c2t8d0s0                2889     No    Okay        
        c2t9d0s0                2889     No    Okay        
        c2t10d0s0               2889     No    Okay        
        c2t11d0s0               2889     No    Okay        




hsp000: 2 hot spares
        c2t12d0s0               In use          71124291 blocks
        c2t13d0s0               Available       71124291 blocks


#

OK, c2t5d0s0 is the sixth disk from the left in the array - or to put that another way it is the left one of the two middle disks!

You can run format->analyse->read which will determine whether the disk is really dead.
# format
Searching for disks...done




AVAILABLE DISK SELECTIONS:
       0. c1t0d0
          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cffde452,0
       1. c1t1d0

          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cffde367,0
       2. c2t0d0

          /pci@8,600000/pci@1/scsi@4/sd@0,0
       3. c2t1d0

          /pci@8,600000/pci@1/scsi@4/sd@1,0
       4. c2t2d0

          /pci@8,600000/pci@1/scsi@4/sd@2,0
       5. c2t3d0

          /pci@8,600000/pci@1/scsi@4/sd@3,0
       6. c2t4d0

          /pci@8,600000/pci@1/scsi@4/sd@4,0
       7. c2t5d0

          /pci@8,600000/pci@1/scsi@4/sd@5,0
       8. c2t8d0

          /pci@8,600000/pci@1/scsi@4/sd@8,0
       9. c2t9d0

          /pci@8,600000/pci@1/scsi@4/sd@9,0
      10. c2t10d0

          /pci@8,600000/pci@1/scsi@4/sd@a,0
      11. c2t11d0

          /pci@8,600000/pci@1/scsi@4/sd@b,0
      12. c2t12d0

          /pci@8,600000/pci@1/scsi@4/sd@c,0
      13. c2t13d0

          /pci@8,600000/pci@1/scsi@4/sd@d,0
Specify disk (enter its number): 7
selecting c2t5d0
[disk formatted]


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !
     - execute , then return
        quit
format> anal


ANALYZE MENU:
        read     - read only test   (doesn't harm SunOS)
        refresh  - read then write  (doesn't harm data)
        test     - pattern testing  (doesn't harm data)
        write    - write then read      (corrupts data)
        compare  - write, read, compare (corrupts data)
        purge    - write, read, write   (corrupts data)
        verify   - write entire disk, then verify (corrupts data)
        print    - display data buffer
        setup    - set analysis parameters
        config   - show analysis parameters
        !
   - execute , then return
        quit
analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y

        pass 0
   24619/26/53 

        pass 1
   24619/26/53 

Total of 0 defective blocks repaired.
analyze> q


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !
     - execute , then return
        quit
format> q
#

The excerpt above didn't show any repairs, but they may be some when you run the command.



Raise a call for the new disk under your support contract. Or failing that search the server room for a compatible disk - Good Luck!

As long as a hot spare has jumped in, the dead disk can just be removed from the array and the new one inserted. The disk will spin up immediately.

Wait for the green light to come on. And Bob's your uncle!

Run format to apply a disk label. You won't be able to format the disk until you do.
# format
Searching for disks...done

c2t5d0: configured with capacity of 33.92GB


AVAILABLE DISK SELECTIONS:
       0. c1t0d0

          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cffde452,0
       1. c1t1d0

          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cffde367,0
       2. c2t0d0

          /pci@8,600000/pci@1/scsi@4/sd@0,0
       3. c2t1d0

          /pci@8,600000/pci@1/scsi@4/sd@1,0
       4. c2t2d0

          /pci@8,600000/pci@1/scsi@4/sd@2,0
       5. c2t3d0

          /pci@8,600000/pci@1/scsi@4/sd@3,0
       6. c2t4d0

          /pci@8,600000/pci@1/scsi@4/sd@4,0
       7. c2t5d0

          /pci@8,600000/pci@1/scsi@4/sd@5,0
       8. c2t8d0

          /pci@8,600000/pci@1/scsi@4/sd@8,0
       9. c2t9d0

          /pci@8,600000/pci@1/scsi@4/sd@9,0
      10. c2t10d0

          /pci@8,600000/pci@1/scsi@4/sd@a,0
      11. c2t11d0

          /pci@8,600000/pci@1/scsi@4/sd@b,0
      12. c2t12d0

          /pci@8,600000/pci@1/scsi@4/sd@c,0
      13. c2t13d0

          /pci@8,600000/pci@1/scsi@4/sd@d,0
Specify disk (enter its number): 7
selecting c2t5d0
[disk formatted]
Disk not labeled.  Label it now? y


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !
     - execute , then return
        quit
format> q
#


Enter a prtvtoc/fmthard command combination to ensure the disk has the same slices as the replaced disk. In the command below I use the disk that is in the equivalent position on the other side of the mirror as the source of the configuration.
# prtvtoc /dev/rdsk/c2t0d0s0 | fmthard -s - /dev/rdsk/c2t5d0s0
fmthard:  New volume table of contents now in place.
#



Check the disk format:
# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c1t0d0

          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cffde452,0
       1. c1t1d0

          /pci@9,600000/SUNW,qlc@2/fp@0,0/ssd@w21000004cffde367,0
       2. c2t0d0

          /pci@8,600000/pci@1/scsi@4/sd@0,0
       3. c2t1d0

          /pci@8,600000/pci@1/scsi@4/sd@1,0
       4. c2t2d0

          /pci@8,600000/pci@1/scsi@4/sd@2,0
       5. c2t3d0

          /pci@8,600000/pci@1/scsi@4/sd@3,0
       6. c2t4d0

          /pci@8,600000/pci@1/scsi@4/sd@4,0
       7. c2t5d0

          /pci@8,600000/pci@1/scsi@4/sd@5,0
       8. c2t8d0

          /pci@8,600000/pci@1/scsi@4/sd@8,0
       9. c2t9d0

          /pci@8,600000/pci@1/scsi@4/sd@9,0
      10. c2t10d0

          /pci@8,600000/pci@1/scsi@4/sd@a,0
      11. c2t11d0

          /pci@8,600000/pci@1/scsi@4/sd@b,0
      12. c2t12d0

          /pci@8,600000/pci@1/scsi@4/sd@c,0
      13. c2t13d0

          /pci@8,600000/pci@1/scsi@4/sd@d,0
Specify disk (enter its number): 7
selecting c2t5d0
[disk formatted]


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !
     - execute , then return
        quit
format> p


PARTITION MENU:
        0      - change `0' partition
        1      - change `1' partition
        2      - change `2' partition
        3      - change `3' partition
        4      - change `4' partition
        5      - change `5' partition
        6      - change `6' partition
        7      - change `7' partition
        select - select a predefined table
        modify - modify a predefined partition table
        name   - name the current table
        print  - display the current table
        label  - write partition map and label to the disk
        !
- execute , then return
        quit
partition> p
Current partition table (original):
Total disk cylinders available: 24620 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       0 - 24618       33.91GB    (24619/0/0) 71124291
  1 unassigned    wu       0                0         (0/0/0)            0
  2     backup    wu       0 - 24619       33.92GB    (24620/0/0) 71127180
  3 unassigned    wu       0                0         (0/0/0)            0
  4 unassigned    wu       0                0         (0/0/0)            0
  5 unassigned    wu       0                0         (0/0/0)            0
  6 unassigned    wu       0                0         (0/0/0)            0
  7 unassigned    wm   24619 - 24619        1.41MB    (1/0/0)         2889

partition> q


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !
     - execute , then return
        quit
format> q

#


Replace the hotspare disk with the original - the replacement command works on the mirror not the submirror which actually has the failed disk!:
# metareplace -e d3 c2t5d0s0
#


Use metastat | grep to check when the mirror has finished resync-ing
#  metareplace -e d3 c2t5d0s0
d3: device c2t5d0s0 is enabled
# metastat
d0: Mirror
    Submirror 0: d10
      State: Okay        
    Submirror 1: d20
      State: Okay        
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 18876726 blocks

d10: Submirror of d0
    State: Okay        
    Size: 18876726 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t0d0s0                   0     No    Okay        


d20: Submirror of d0
    State: Okay        
    Size: 18876726 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t1d0s0                   0     No    Okay        


d1: Mirror
    Submirror 0: d11
      State: Okay        
    Submirror 1: d21
      State: Okay        
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 33555735 blocks

d11: Submirror of d1
    State: Okay        
    Size: 33555735 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t0d0s1                   0     No    Okay        


d21: Submirror of d1
    State: Okay        
    Size: 33555735 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t1d0s1                   0     No    Okay        


d3: Mirror
    Submirror 0: d30
      State: Okay        
    Submirror 1: d31
      State: Resyncing   
    Resync in progress: 0 % done
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 355604121 blocks

d30: Submirror of d3
    State: Okay        
    Hot spare pool: hsp000
    Size: 355604121 blocks
    Stripe 0: (interlace: 64 blocks)
        Device              Start Block  Dbase State        Hot Spare
        c2t0d0s0                   0     No    Okay        
        c2t1d0s0                2889     No    Okay        
        c2t2d0s0                2889     No    Okay        
        c2t3d0s0                2889     No    Okay        
        c2t4d0s0                2889     No    Okay        


d31: Submirror of d3
    State: Resyncing   
    Hot spare pool: hsp000
    Size: 355604121 blocks
    Stripe 0: (interlace: 64 blocks)
        Device              Start Block  Dbase State        Hot Spare
        c2t5d0s0                   0     No    Resyncing   
        c2t8d0s0                2889     No    Okay        
        c2t9d0s0                2889     No    Okay        
        c2t10d0s0               2889     No    Okay        
        c2t11d0s0               2889     No    Okay        


hsp000: 2 hot spares
        c2t12d0s0               Available       71124291 blocks
        c2t13d0s0               Available       71124291 blocks

# metastat d3 | grep "Resync in progress"
    Resync in progress: 5 % done
# metastat d3 | grep "Resync in progress"
    Resync in progress: 5 % done
# metastat d3 | grep "Resync in progress"
    Resync in progress: 6 % done
# metastat d3 | grep "Resync in progress"
    Resync in progress: 6 % done
# metastat d3 | grep "Resync in progress"
    Resync in progress: 8 % done
# metastat d3 | grep "Resync in progress"
    Resync in progress: 51 % done
# metastat d3 | grep "Resync in progress"
    Resync in progress: 60 % done
#



Just repeat that last metastat command until you get no output and the resync-ing will have completed.


And that's that!

Sunday, March 14, 2010

Further Good Links/Tools

In a previous post, I described some tools that I had found useful for documenting the system environment.

A pretty comprehensive system description of many versions of the Windows OS can be generated by SIW - Systems Information for Windows, which was written by Gabriel Topala. It is only free for personal use. The pricing model for Businesses is in the right sort of ball park.

If you are a Legato Networker user and you need to generate reports for either your own benefit or the benefit of others, then the Networker Reporting Utility is well worth a look.The download mechanism is rather painful, but it is worthwhile. There is a product from Legato which does a similar job, but for our environment we were looking at having to spend upwards of £30,000 to £40,000! That decision was pretty much a no-brainer!

And that's that for now!

Friday, March 12, 2010

What a PITA!!

This week I've been configuring a couple of new HP servers, ProLiant DL 360 G6s to be precise. They were configured to have 8 internal disks so the internal CD/DVD-ROM had to be sacrificed.

So to configure them, I temporarily attached a USB CD_ROM drive, Keyboard and Mouse and attached a monitor.

Having installed the latest CentOS x86_64 Linux from CD on the first one, I removed the CD-ROM and rebooted. It hung coming up configuring the USB storage driver!

So I spent a day googling for help. I re-installed Linux half a dozen or so times. I downloaded the DVD and followed the instructions to make that available via both HTTP and NFS and tried using the first CD to boot from and then installing across the network. Actually that was considerably faster than the CD method!


Nothing changed the hang during boot. Always on the USB storage driver installation.

As a punt with nothing else to lose, I unplugged the USB keyboard and mouse and rebooted.

The b*$#*^d box booted all the way.

I plugged the keyboard in and it was recognized without fuss and was immediately useful.

So it goes! Some days are just like that.

Sunday, February 21, 2010

Disaster Recovery 101 Part 2

So in part 1, I promised to discuss some of the "soft" requirements to be considered when preparing for a recovery from disaster.


The following documentation should be held electronically. These days it is probably more difficult not to hold documentation electronically.


Documentation of your hardware maintenance contracts
The purpose of this is not to fix your lost equipment, but to immediately take the lost servers off the contract and to later update it with their replacements. This might not seem like a priority, but the larger the company the larger the cost saving that performing this action will save. Should you manage to recover some of your servers you'll need to know this information to raise support calls.

This action presupposes that you have negotiated your contract so that you can add and remove items during the life of the contract. If you haven't already, you should start doing so from your next renewal date.


Your policies & procedures.
Some might argue that a clean slate is the perfect opportunity to start again. As creating these documents isn't the most fun activity in the known universe, I have some sympathy with this idea. However it really should be resisted. It will have taken you some considerable time to develop those policies & procedures. Some policies might need tweaking, some might need obsoleting, but they are a gold mine of information about your environment.


Good Supplier Relationships
Another "soft" requirement. If it isn't obvious why this is a requirement for disaster recovery, consider that on that Sunday morning I mentioned in part 1 that by 11:00am one of our two main hardware suppliers had:
  • opened their offices
  • provided us with internet access
  • provided us with hardware that had been purchased by and for someone else!
  • provided us with lab space; phones; electricity; etc
and were arranging with their security company to allow us to stay through the night whilst we worked at building and recovering our backup system from the backup tapes and some installation media.

(Obviously, we did later pay for the hardware. Whether the original purchasers were ever told, I do not know.)

Of course, a good supplier relationship is not something that can't be magic-ed out of a hat first thing on the morning of your disaster. Good supplier relationships are an ongoing concern. That doesn't mean that you overpay for goods and services. That isn't a good relationship. That is being a doormat. It also doesn't mean screwing them over on every deal. It does mean being open with them. Working with them over a long time so that they understand your requirements; that sometimes every quote doesn't lead to a purchase; that company rules require that you get quotes from other suppliers too!


Offsite storage for the backup tapes and all the other documentation above
Offsite storage for your backup tapes is fairly standard. But how frequently do the tapes go offsite? It needs to be daily during the week! If your company is large enough to be able to afford weekend shifts, then you might also want to investigate weekend pickups as well.

Every month on the first day of the month my admin server sends me an email. That email reminds me to burn all the latest documentation from the intranet site down onto a DVD. That DVD then stays in my laptop bag until the next month.

Some of the documentation is stored in a number of Lotus Notes databases. These are even more ideal from a DR perspective. (It is a shame that IBM have made a hash of marketing Lotus Notes - some of its features are ideal for enterprises of any size. But that is a story for another blog post.) You can just make a local replica of the database onto your PC. Whatever the contents of that database. And it can be kept in step via replication, which can be as often as you like. Or never after the initial replication.

At a time when my company operated a campus of multiple building, and indeed multiple sites, a firesafe in one of the other buildings was considered offsite.


How to recover your environment.
Given your backup tapes and an empty room, would you know where to start/how to start to rebuild your environment? This is a question Joel Spolsky covered quite cogently in a post just before Christmas. Doing the backup is part of the bread and butter of the job. But so should be the restore.

Even with all the knowledge of your environment that you should have documented, the answer to the question of which servers to restore first will be similar to the the start-up order of your datacentre. Similar, but unlikely to be exactly the same.

Of course, the shutdown and startup orders will be part of the documentation listed under "Description of Inter-relationships" described in part 1.

Most "old lags" in IT will have a good idea of which systems need to be restored first and how to do so. Hopefully, there will have been an exercise in how to to restore individual servers and systems over time.

In part 1 of this series of posts on disaster recovery, I listed some/most of the information you should keep for each server. However, I missed some items out:
It is just about essential to list what is backed up & how to recover the server with that dataset.
And when you have recovered your server, how do you know it is recovered? How can you prove it has been recovered successfully?

Document a series of tests that will exercise the functionality of the server/system fully or at least to some acceptable (to you and justifiable to others) level of completeness. Generally this information is referred to as return to service information (RTS).


Knowledge of the company's insurance policy
This might not be regarded as an IT responsibility. In some companies this might be a site services or facilities management responsibility or a Financial or Legal Dept. concern. In smaller companies, the office manager might be responsible.
In fact, Id agree it wasn't an IT responsibility or shouldn't be. But if you are responsible for the company's IT infrastructure, you should make yourself aware of whether your company actually has Critical Incident Insurance or whether your company is large enough to carry the risk itself.

The answer will help you prepare. If the replacement cost of your infrastructure is US$2million and your comapany has no insurance, then the business should know that up to that amount will have to be found/provided in a disaster.

If the company does have insurance, then it is necessary to keep that policy up to date with the value of the company's infrastructure.


Multiple sites
In theory, having multiple sites should enable you to provide resilience through replication of  information to the other sites. It depends upon your level of risk and the budget available to you whether you implement replication.

But it is possible to mitigate a lot of risk through data replication between sites. At one stage, there was only the UNIX utilities: rdist and then rsync to accomplish the task. But they work at the file level. Then a lot of companies worked out how to accomplish this task at the block level. NetApp were possibly the first - the first I was aware of anyway. but it now appears to be a common facility in every venders' repertoire.

Both free and paid for dbms offer varieties of replication, master/slave and master/master. One of the best database replication mechanisms seems to be that used by Lotus Notes. But Lotus Notes isn't suitable for all applications. Plus IBM doesn't seem to have known how to market it. Actually IBM frequently doesn't appear to know how to market anything. Anyway, it should be possible to set up your database applications, to be location independent.


Well, that was part 2. After part 1, I stated there would be an additional two parts, Whilst finalising this part, I realised some of the issues I overlooked. There may well be a part 4. It depends upon how part 3 goes.

Saturday, February 20, 2010

Downsides of Virtualisation?

This post possibly falls into the "stating the bleeding obvious " category! However,...

Reducing server count isn't just about reducing the amount of hardware. But, if you have virtualised a bunch of servers will you be so keen to look into combining their functions?
There still can be value in combining the functions of two virtual servers. There are the management benefits of fewer servers to manage and if you have software that is licensed on a per server basis, then there will be financial benefits too.

Eventually, you will develop an architectural dependence on servers that are virtualised. If your mail relay is virtualised and your payroll system isn't, then should your payroll system need to send email, a lot of the infrastructure needed for your virtualisation environment will need to be up and running too.
In most cases the virtual environment will be up and running. However, there are periods when you may be recovering from a disaster when that isn't the case.

Are these reasons not to virtualise?

Of course, not!
But they are factors to consider and factor in to your decision making.

Tuesday, February 16, 2010

Some good links

I came across a blog just today by following a link on Scott Lowe's bookmarks: Linux Performance Tuning

This is really useful information. All pulled together in one place. Some of it might seem common sense, but as has frequently been said it is surprising how uncommon common sense can be.


As I wrote in my previous post about recovering from a disaster, documentation on your environment is vital. And these two links below provide that.


I've only just noticed that RVTools has been considerably updated. This is an excellent tool and the more so for being free. I feel a bit guilty about  not being in a position to donate some of my employers' money via paypal.

Actually, the new feature that I'm most excited by is one that has been there for several versions now. Specifically, the ability to export all the information in csv format. And to do that from the command line!


I was browsing sunfreeware and came across sys_diag, a script generated by Todd Jobson. For doumenting the state of a Solaris server, it looks damn comprehensive. I intend to use it on the Solaris systems at my work and save the results into our database of server information.