Home Domain Tools How To Photo Albums Unix Stuff Support
CDRW
Space Recovery
SDS Info
SDS CREATING
SDS MIRROR
SDS REPLACING
SDS INSTALLATION
MAINTENANCE SDS
REMOVING SDS
 
 

Replacing a failed bootdisk


In the following example, the host has a failed bootdisk (c0t0d0). Fortunately, the system is using DiskSuite, with a mirror at c0t1d0. The following sequence of steps can be used to restore the system to full redundancy.


  • System fails to boot 
  • Boot from mirror 
  • Check extent of failures 
  • Replace failed disk and restore redundancy 




System fails to boot


When the system attempts to boot, it fails to find a valid device as required by the boot-device path at device alias "disk". It then attempts to boot from the network:


screen not found.

Can't open input device.

Keyboard not present.  Using ttya for input and output.


Sun Ultra 30 UPA/PCI (UltraSPARC-II 296MHz), No Keyboard

OpenBoot 3.27, 512 MB memory installed, Serial #9377973.

Ethernet address 8:0:20:8f:18:b5, Host ID: 808f18b5.




Initializing Memory

Timeout waiting for ARP/RARP packet

Timeout waiting for ARP/RARP packet

Timeout waiting for ARP/RARP packet

...

                 


Boot from mirror


At this point, the administrator realizes that the boot disk has failed, and queries the device aliases to find the one corresponding to the disksuite mirror:


ok devalias

mirrordisk               /pci@1f,4000/scsi@3/disk@1,0

bootdisk                 /pci@1f,4000/scsi@3/disk@0,0

net                      /pci@1f,4000/network@1,1

disk                     /pci@1f,4000/scsi@3/disk@0,0

cdrom                    /pci@1f,4000/scsi@3/disk@6,0:f

...

                 

The administrator then boots the system from the mirror device "sds-mirror":


ok boot mirrordisk

                 

The system starts booting off of sds-mirror. However, because there are only two of the original four state database replicas available, a quorum is not achieved. The system requires manual intervention to remove the two failed state database replicas:


Starting with DiskSuite 4.2.1, an optional /etc/system parameter exists which allows DiskSuite to boot with just 50% of the state database replicas online. For example, if one of the two boot disks were to fail, just two of the four state database replicas would be available. Without this /etc/system parameter (or with older versions of DiskSuite), the system would complain of "insufficient state database replicas", and manual intervention would be required on bootup. To enable the "50% boot" behaviour with DiskSuite 4.2.1, execute the following command:

# echo "set md:mirrored_root_flag=1" >> /etc/system


Boot device: /pci@1f,4000/scsi@3/disk@1,0  File and args:

SunOS Release 5.8 Version Generic_108528-07 64-bit

Copyright 1983-2001 Sun Microsystems, Inc.  All rights reserved.

WARNING: md: d10: /dev/dsk/c0t0d0s0 needs maintenance

WARNING: forceload of misc/md_trans failed

WARNING: forceload of misc/md_raid failed

WARNING: forceload of misc/md_hotspares failed

configuring IPv4 interfaces: hme0.

Hostname: pegasus

metainit: pegasus: stale databases


Insufficient metadevice database replicas located.


Use metadb to delete databases which are broken.

Ignore any "Read-only file system" error messages.

Reboot the system when finished to reload the metadevice database.

After reboot, repair any broken database replicas which were deleted.


Type control-d to proceed with normal startup,

(or give root password for system maintenance): ******


single-user privilege assigned to /dev/console.

Entering System Maintenance Mode


Oct 17 19:11:29 su: 'su root' succeeded for root on /dev/console

Sun Microsystems Inc.   SunOS 5.8       Generic February 2000


# metadb -i

        flags           first blk       block count

    M     p             unknown         unknown         /dev/dsk/c0t0d0s5

    M     p             unknown         unknown         /dev/dsk/c0t0d0s6

     a m  p  lu         16              1034            /dev/dsk/c0t1d0s5

     a    p  l          16              1034            /dev/dsk/c0t1d0s6

o - replica active prior to last mddb configuration change

u - replica is up to date

l - locator for this replica was read successfully

c - replica's location was in /etc/lvm/mddb.cf

p - replica's location was patched in kernel

m - replica is master, this is replica selected as input

W - replica has device write errors

a - replica is active, commits are occurring to this replica

M - replica had problem with master blocks

D - replica had problem with data blocks

F - replica had format problems

S - replica is too small to hold current data base

R - replica had device read errors



# metadb -d c0t0d0s5 c0t0d0s6

metadb: pegasus: /etc/lvm/mddb.cf.new: Read-only file system


# metadb -i

        flags           first blk       block count

     a m  p  lu         16              1034            /dev/dsk/c0t1d0s5

     a    p  l          16              1034            /dev/dsk/c0t1d0s6

o - replica active prior to last mddb configuration change

u - replica is up to date

l - locator for this replica was read successfully

c - replica's location was in /etc/lvm/mddb.cf

p - replica's location was patched in kernel

m - replica is master, this is replica selected as input

W - replica has device write errors

a - replica is active, commits are occurring to this replica

M - replica had problem with master blocks

D - replica had problem with data blocks

F - replica had format problems

S - replica is too small to hold current data base

R - replica had device read errors


# reboot -- sds-mirror


                 


Check extent of failures


Once the reboot is complete, the administrator then logs into the system and checks the status of the DiskSuite metadevices. Not only have the state database replicas failed, but all of the DiskSuite metadevices previously located on device c0t0d0 need to be replaced. Clearly the disk has completely failed.


pegasus console login: root

Password:  ******

Oct 17 19:14:03 pegasus login: ROOT LOGIN /dev/console

Last login: Thu Oct 17 19:02:42 from rambler.wakefie

Sun Microsystems Inc.   SunOS 5.8       Generic February 2000


# metastat

d0: Mirror

    Submirror 0: d10

      State: Needs maintenance

    Submirror 1: d20

      State: Okay        

    Pass: 1

    Read option: roundrobin (default)

    Write option: parallel (default)

    Size: 13423200 blocks


d10: Submirror of d0

    State: Needs maintenance

    Invoke: metareplace d0 c0t0d0s0 <new device>

    Size: 13423200 blocks

    Stripe 0:

        Device              Start Block  Dbase State        Hot Spare

        c0t0d0s0                   0     No    Maintenance 



d20: Submirror of d0

    State: Okay        

    Size: 13423200 blocks

    Stripe 0:

        Device              Start Block  Dbase State        Hot Spare

        c0t1d0s0                   0     No    Okay        



d1: Mirror

    Submirror 0: d11

      State: Needs maintenance

    Submirror 1: d21

      State: Okay        

    Pass: 1

    Read option: roundrobin (default)

    Write option: parallel (default)

    Size: 2100000 blocks


d11: Submirror of d1

    State: Needs maintenance

    Invoke: metareplace d1 c0t0d0s1 <new device>

    Size: 2100000 blocks

    Stripe 0:

        Device              Start Block  Dbase State        Hot Spare

        c0t0d0s1                   0     No    Maintenance 



d21: Submirror of d1

    State: Okay        

    Size: 2100000 blocks

    Stripe 0:

        Device              Start Block  Dbase State        Hot Spare

        c0t1d0s1                   0     No    Okay        



d4: Mirror

    Submirror 0: d14

      State: Needs maintenance

    Submirror 1: d24

      State: Okay        

    Pass: 1

    Read option: roundrobin (default)

    Write option: parallel (default)

    Size: 2100000 blocks


d14: Submirror of d4

    State: Needs maintenance

    Invoke: metareplace d4 c0t0d0s4 <new device>

    Size: 2100000 blocks

    Stripe 0:

        Device              Start Block  Dbase State        Hot Spare

        c0t0d0s4                   0     No    Maintenance 



d24: Submirror of d4

    State: Okay        

    Size: 2100000 blocks

    Stripe 0:

        Device              Start Block  Dbase State        Hot Spare

        c0t1d0s4                   0     No    Okay        


                 


Replace failed disk and restore redundancy


The administrator replaces the failed disk with a new disk of the same geometry. Depending on the system model, the disk replacement may require that the system be powered down. The replacement disk is then partitioned identically to the mirror, and state database replicas are copied onto the replacement disk. Finally, the metareplace command copies that data from the mirror to the replacement disk, restoring redundancy to the system.


# prtvtoc /dev/rdsk/c0t1d0s2 | fmthard -s - /dev/rdsk/c0t0d0s2

fmthard:  New volume table of contents now in place.


# installboot /usr/platform/sun4u/lib/fs/ufs/bootblk /dev/rdsk/c0t0d0s0


# metadb -f -a /dev/dsk/c0t0d0s5


# metadb -f -a /dev/dsk/c0t0d0s6


# metadb -i

        flags           first blk       block count

     a        u         16              1034            /dev/dsk/c0t0d0s5

     a        u         16              1034            /dev/dsk/c0t0d0s6

     a m  p  luo        16              1034            /dev/dsk/c0t1d0s5

     a    p  luo        16              1034            /dev/dsk/c0t1d0s6

o - replica active prior to last mddb configuration change

u - replica is up to date

l - locator for this replica was read successfully

c - replica's location was in /etc/lvm/mddb.cf

p - replica's location was patched in kernel

m - replica is master, this is replica selected as input

W - replica has device write errors

a - replica is active, commits are occurring to this replica

M - replica had problem with master blocks

D - replica had problem with data blocks

F - replica had format problems

S - replica is too small to hold current data base

R - replica had device read errors


# metareplace -e d0 c0t0d0s0

d0: device c0t0d0s0 is enabled


# metareplace -e d1 c0t0d0s1

d1: device c0t0d0s1 is enabled


# metareplace -e d4 c0t0d0s4

d4: device c0t0d0s4 is enabled


               

Once the resync process is complete, operating system redundancy has been restored.