- May 2008
Copyright 2008 by Robert Forsman <raid@thoth.purplefrog.com>
I have a house file server. I am routinely adding hard drives to it as I fill up the others. The main consumer of space is fingerprint-proof backups of my DVD collection. I'd rather not expose the disks to the hazards of handling*, so I rip them to hard drive and the originals go to live on the shelf.
* - You can find articles where people claim that a DVD can be damaged by the bending it endures when being removed from an unreasonably tight DVD case. There are also claims of DVD rot where the reflective layer corrodes, probably due to contaminants from the manufacturing process. I'm hoping to be able to enjoy my purchases long after the original physical media fails.
Since I really don't want to go to the hassle of re-ripping the collection when one of the hard drives fails (and it is a question of "when", not if), I use RAID5. In order to allow my filesystems to grow and move, I use LVM on top of the RAIDs. Recent linux kernels support the ability to grow a RAID5, but my house file server isn't running anything recent.
partitioning | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In the days of 300G drives, 64g chunks seemed like a good building block. Now that 750Gs are common and 1TB are on the shelf at Fry's, 128G seems a little more plausible. You can still fit 7 of those in a 1TB drive, which seems a little excessive, but reconstructing a 128G slice of a raid already takes a couple of hours, and I don't have a fast enough machine to make me comfortable moving to 256G chunks.
In this scenario there are four RAID5 arrays (/dev/md[3567]), each made of three 64G chunks. There are also two mirrors (/dev/md[01]). These mirrors are physical volumes allocated to the mg20 volume group.
allocation | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For educational purposes, this is how you could replicate this setup:
mdadm --create -l 1 -n 3 /dev/md0 /dev/sdb9 /dev/sdc9 missing mdadm --create -l 1 -n 3 /dev/md1 /dev/sdb10 /dev/sdc10 missing mdadm --create -l 5 -n 3 /dev/md3 /dev/sd[abc]5 mdadm --create -l 5 -n 3 /dev/md5 /dev/sd[abc]6 mdadm --create -l 5 -n 3 /dev/md6 /dev/sd[abc]7 mdadm --create -l 5 -n 3 /dev/md7 /dev/sd[abc]8 vi /etc/mdadm.conf pvcreate /dev/md[013567] /dev/sda9 /dev/sdb11 /dev/sdc11 /dev/sdc12 vgcreate mg20 /dev/md[013567] vgcreate vg20 /dev/sda9 /dev/sdb11 /dev/sdc11 /dev/sdc12 |
I use mirrors for data that I really don't want to lose (mail archives, CVS repository). I use RAID5 for stuff that it would be inconvenient to lose (DVD backups). Even mirrors will not protect your data from an errant rm -rf, or software error. Make backups to another machine. If that machine can be in another state, even better. The mg20 volume group is all-RAID. Any volume created in that VG will be resistant to hardware failure.
The vg20 volume group has no RAID components. If one of the drives dies, any logical volume with extents on the failed drive will be destroyed. I use vg20 for backups of other computers in the house. If a drive fails, the other computer is probably still fine.
By creating two 128G partitions on the new 750 I have seriously cramped my style. Incorporating them into a RAID where the other partitions are 64G would be a massive waste of space. However, since hard drives are getting ridiculously large the 64G chunk size is becoming unwieldy. I will be buying more drives and they will be 750G or larger, so I will have more 128G partitions.
I want the finished product to have the same amount of mirrored space. I mostly want to expand the RAID5 space so I can rip the next DVDs I buy. Let's look at the goal:
goal | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
You'll notice that MD1 and MD0 need to be shuffled around to make room for the MD2 RAID5. Also, all the old 3-drive RAID5s are being replaced by 4-drive RAID5s. The fact that /dev/sdd[56] are 128G kind of damages the aesthetics, but defective aesthetics are nothing new in the world of computers.
Achieving this goal requires a significant amount of gymnastics, which is the point of this article.
The rough sequence you will find useful is to move elements of existing mirrors out of the way to allow you to create new RAID5 partitions. Then you can pvmove the data off the old RAID5s and recycle their partitions into new larger RAID5s.
mdadm /dev/md0 -a /dev/sdd13 mdadm /dev/md1 -a /dev/sdd12 |
At this point you need to leave the computer alone for an hour or three. It will be busy reading data from the old parittions in the RAID and copying them to the new partition in the RAID. The kernel is smart enough to do one reconstruction at a time. If you make the mistake of doing pvmoves at the same time as a RAID reconstruction, you will cause the operating system to thrash the heads on the hard drive as it deals with two separate subsystems each imposing massive IO loads on different sections of the disk. That's great for triggering infant mortality (if you believe in that for modern hard drives), but it will turn a 2-hour operation into a 20-hour operation.
A simple while sleep 60; do cat /proc/mdstat; done should keep you apprised of the progress of the mirror "reconstruction". The kernel is even kind enough to provide you with an estimated time of completion for each in-progress reconstruction.
intermediate | ||||||||||||||||||||||||||||
|
Once your mirrors are fully active on 3 drives, you can deactivate the old partitions.
mdadm /dev/md0 -f /dev/sdc9 -r /dev/sdc9 mdadm /dev/md1 -f /dev/sdc10 -r /dev/sdc10 |
We had a choice of deactivating /dev/sdc9 or /dev/sdb9 from MD0. We intend to put MD0 on /dev/sdc10. If we had deactivated /dev/sdb9 then we would be copying data from sdd13 and sdc9 onto sdc10. Copying data from sdc9 to sdc10 would cause head thrashing.
intermediate | ||||||||||||||||||||||||||||
|
mdadm /dev/md0 -a /dev/sdc10 |
mdadm /dev/md0 -f /dev/sdb9 -r /dev/sdb9 |
intermediate | ||||||||||||||||||||||||||||
|
mdadm --create -l 5 -n 3 /dev/md2 /dev/sd[bcd]9 vi /etc/mdadm.conf |
intermediate | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Reviewing our TODO list, we find that we still need to replace MD3,5,6, and 7 with 4-drive RAID5s. If I had a modern version of linux, I might be able to expand them in place, but since I do not, I have an opportunity to exercise the LVM layer of my setup. Since the new MD2 is the same size as MD3 we can easily ask LVM to relocate all the data.
pvcreate /dev/md2 vgextend mg20 /dev/md2 pvmove /dev/md3 /dev/md2 |
If there were not enough room on /dev/md2 to fit all of the /dev/md3 data, there is a syntax to pvmove to relocate a subset of the extents. We will touch on that later in the process.
Once the pvmove from /dev/md3 is complete we can deallocate those partitions and build the replacement /dev/md4 RAID.
vgreduce mg20 /dev/md3 pvremove /dev/md3 mdadm --stop /dev/md3 mdadm --misc --zero-superblock /dev/sd[abc]5 mdadm --create -l 5 -n 4 /dev/md4 /dev/sd[abc]5 /dev/sdd11 vi /etc/mdadm.conf pvcreate /dev/md4 vgextend mg20 /dev/md4 |
Be super-careful with the --zero-superblock command. If you use it on the wrong partitions, bad things will happen.
intermediate | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Now that /dev/md4 is part of the mg20 volume group we can relocate /dev/md5's data onto it and then recycle MD5 into MD8.
pvmove /dev/md5 /dev/md4 vgreduce /dev/mg20 /dev/md5 pvremove /dev/md5 mdadm --stop /dev/md5 mdadm --misc --zero-superblock /dev/sd[abc]6 mdadm --create -l 5 -n 4 /dev/md8 /dev/sd[abc]6 /dev/sdd10 vi /etc/mdadm.conf pvcreate /dev/md8 vgextend mg20 /dev/md8 |
intermediate | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Now we are in an interesting position. MD4 was larger than MD5, so it will have some unallocated physical extents. You can see how many extents are free with the pvdisplay command. Here's a sample output from my laptop:
File descriptor 4 left open --- Physical volume --- PV Name /dev/sda6 VG Name vg80 PV Size 227.37 GB / not usable 825.00 KB Allocatable yes PE Size (KByte) 4096 Total PE 58206 Free PE 38119 Allocated PE 20087 PV UUID cPQtZ4-GloJ-FRMg-3B6j-42C8-Cpo0-X2ZH8K |
Notice the Free PE. Those are the units of allocation for LVM. On this PV they are 4096K chunks. I am not sure what would happen if you had volumes with different sizes for the PEs. Since the MD5 RAID had 128G of space and the MD4 RAID had 192G, even if MD5 had been full there would still be 64G of space remaining on MD4, which is more than 16000 extents it would report under Free PE. If you wanted to "compact" your PVs you could issue the following command to fill up the rest of MD4 with extents from MD6.
pvmove /dev/md6:0-16300 /dev/md4 |
Just replace 16300 with the number of free extents on /dev/md4 minus 1. That should fill /dev/md4 (unless MD4 had some holes in its allocation map, but that's a topic for advanced readers). You can then move the other half of MD6 into the fresh MD8.
pvmove /dev/md6 /dev/md8 |
Now deallocate MD6 and build MD9.
vgreduce /dev/mg20 /dev/md6 pvremove /dev/md6 mdadm --stop /dev/md6 mdadm --misc --zero-superblock /dev/sd[abc]7 mdadm --create -l 5 -n 4 /dev/md9 /dev/sd[abcd]7 vi /etc/mdadm.conf pvcreate /dev/md9 vgextend mg20 /dev/md9 |
intermediate | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The only step left is to empty MD7 and recycle it into MD10. Even though MD8 got the last half of MD6s data, it probably still has enough space to absorb all of MD7.
pvmove /dev/md7 /dev/md8 |
Again we do the RAID5 recycle dance:
vgreduce /dev/mg20 /dev/md7 pvremove /dev/md7 mdadm --stop /dev/md7 mdadm --misc --zero-superblock /dev/sd[abc]8 mdadm --create -l 5 -n 4 /dev/md10 /dev/sd[abcd]8 vi /etc/mdadm.conf pvcreate /dev/md10 vgextend mg20 /dev/md10 |
intermediate | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
pvcreate /dev/sdd14 /dev/sdd[56] vgextend vg20 /dev/sdd14 /dev/sdd[56] |
We've added over 250G to vg20 (the non-RAID volume group) and have 384G of new usable RAID space (plus 128G of checksum).
The only guidance I can offer you is that when you create mirrors, don't put them on partitions with the same number. That will make it easier to create a RAID5 that is "pretty".
Then again, you might not care about "pretty".
To be fair, this dance probably took about 1 day, with about an hour of it requiring the operator's attention. This is probably why people pay big money to Netapp and EMC. I assume their software handles crap like this automatically.
In the free software world I'm sure someone somewhere has at least an experimental system for managing this kind of thing automatically and by the time you get around to reading this article it might be ready for early adopters.
mdadm /dev/md3 -a /dev/sdd11 mdadm --grow /dev/md3 --raid-devices=4 pvresize /dev/md3 |
This would save you the trouble of destroying the RAID arrays and building replacements with more disk. You would still have to use the pvresize command to make the physical volumes fill out their enlarged RAID devices.
Another result would be fragmentation which is uninteresting to
sysadmins of modern operating systems. If you object to fragmentation
on aesthetic grounds you can use pvmove to make your logical volumes
occupy contiguous extents.
alexandria thoth # fdisk -l /dev/sda Disk /dev/sda: 300.0 GB, 300069052416 bytes 255 heads, 63 sectors/track, 36481 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sda1 1 2 16033+ 83 Linux /dev/sda2 3 35 265072+ 82 Linux swap / Solaris /dev/sda3 36 166 1052257+ 83 Linux /dev/sda4 167 36481 291700237+ 5 Extended /dev/sda5 167 8325 65537136 fd Linux raid autodetect /dev/sda6 8326 16484 65537136 fd Linux raid autodetect /dev/sda7 16485 24643 65537136 fd Linux raid autodetect /dev/sda8 24644 32803 65545168+ fd Linux raid autodetect /dev/sda9 32804 36481 29543503+ 8e Linux LVM |
I think it is possible to resize a partition containing a PV without data loss but I have not experimented with it. You will use the pvresize tool to expand into the new space. Resizing RAID is covered later in this document and is probably not worth the hassle. |
You will also want to deactivate any filesystems that live on the disk after the partition that you will be altering. This often means unmounting all logical volumes and deactivating all volume groups ( vgchange -a n ). Also deactivate all RAIDs that have pieces on that disk ( mdadm --stop /dev/mdwhatever ).
Following that use fdisk to edit the partition table creating a
replacement /dev/sda5 with a start equivalent to the old sd5 and an
end equivalent to the old sd6. Before you write the
partition table to disk print it out and make sure it looks exactly
how you want it to. Partitions that are not being consolidated will
have different numbers but the start/end should be the same.
Command (m for help): p Disk /dev/sda: 300.0 GB, 300069052416 bytes 255 heads, 63 sectors/track, 36481 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sda1 1 2 16033+ 83 Linux /dev/sda2 3 35 265072+ 82 Linux swap / Solaris /dev/sda3 36 166 1052257+ 83 Linux /dev/sda4 167 36481 291700237+ 5 Extended /dev/sda5 16485 24643 65537136 fd Linux raid autodetect /dev/sda6 24644 32803 65545168+ fd Linux raid autodetect /dev/sda7 32804 36481 29543503+ 8e Linux LVM /dev/sda8 167 16484 65537136 fd Linux raid autodetect |
If the fact that the partitions are out of order bothers you, go ahead and delete and recreate the partitions that come after the resized one. Initial experiments indicate that LVM and mdadm do not care. LVM and mdadm do not care about device IDs because they use UUIDs stored inside and can find their pieces no matter how you number them.
You can now write the partition table to disk, but it is very common that the operating system is unable to accept the new partition table.
Anything that references partitions that have been renumbered (sda7,8,9 became 5,6,7 ; while sda5 should not be referenced by anything any more) will have to be updated. If you're using these partitions purely for LVM then I don't know of any files that reference partitions. LVM scans hard drives and picks up anything that has the right ID and a PV header. For RAIDs, your /etc/mdadm.conf should be identifying things with a uuid=, which means there are no partition names. If your mdadm.conf does use partition names, you will have to adjust them. If you are using any of the partitions for a regular filesystem, you'll have to update /etc/fstab.
If you need to reboot you can do it now. The freshly booted kernel will read your new partition table. If you screwed anything up, be ready to do some hardcore troubleshooting. A printout of your old partition table will probably save your butt.
(If the repartitioned disk only had LVM and RAID on it, you probably do not need a reboot. A reboot is only required if the kernel had a locked copy of the partition table in RAM for its own safety, which I think only happens if there are normal filesystems currently mounted from that drive.)
The problem is that the RAID superblock lives at the end of each partition (which makes it easier for LILO and grub to boot from a RAID1 mirror. They don't even realize it is part of a RAID). As a result each partition resize wrecks that element of the RAID (because the superblock is in the middle of the expanded partition, not the end) and it must be reconstructed. During this reconstruction phase your are vulnerable to a disk failure at the same time you are thrashing the bejeezus out of several hard drives.
Let us imagine a RAID5 called /dev/md0 built from /dev/sda5, b5, and, c5. Let us also imagine that you have emptied out the partitions after each of them in preparation for expansion.
# mdadm --stop /dev/md0
# fdisk /dev/sda
expand /dev/sda5.
# mdadm --assemble /dev/md0
# mdadm /dev/md0 -a /dev/sda5
# while grep recovery /proc/mdstat; do sleep 60; done
this should take a while. Large RAIDs can take hours.
# cat /proc/mdstat
review this output to make sure the RAID is in a good state.
# mdadm --stop /dev/md0
# fdisk /dev/sdb
expand /dev/sdb5.
# mdadm --assemble /dev/md0
# mdadm /dev/md0 -a /dev/sdb5
# while grep recovery /proc/mdstat; do sleep 60; done
# cat /proc/mdstat
# mdadm --stop /dev/md0
# fdisk /dev/sdc
expand /dev/sdc5.
# mdadm --assemble /dev/md0
# mdadm /dev/md0 -a /dev/sdc5
# while grep recovery /proc/mdstat; do sleep 60; done
# cat /proc/mdstat
# mdadm --stop /dev/md0
# mdadm /dev/md0 --grow -z max
# mdadm --assemble /dev/md0
# cat /proc/mdstat
|
Have I talked you out of it yet?