Battle Against Any Raid Five/Four |
My primary motivation for writing this is to get the idea off my chest, and then let someone point out the original implementation of what I have independently reinvented.
I am considering the following independent layers of storage:
layer | |
---|---|
filesystem | ext3, reiser, xfs, zfs |
LVM | lv/vg/pv |
RAID | striping, mirroring, checksums |
disk partitions |
The separation of the layers simplifies the code of each layer, makes it easier to test, reduces the number of dark corners in which bugs will hide, and makes it easier to do something new with each layer.
However, there are inefficiencies that come from these nice clean separations. When you perform a pvmove in the LVM layer, it will move blocks that the filesystem is not using. Strictly speaking this is a waste of time and spindle bearing life. When you add a drive to a RAID (either to replace a failed drive, or to move the partitions around) it computes checksums for blocks that are not in use by the LVM layer.
The simplicity afforded by the separation of layers has performance costs.
I routinely (once or twice a year) add a hard drive to my house file server in order to expand my capacity. In order to integrate this new hard drive I perform operations at several layers of the storage system (mostly the LVM and RAID layers). I think that if I could merge LVM and RAID it would simplify this expansion.
sda | sdb | sdc | sdd | sde | |||
RAID5 mirror stripe | |||||||
unused | |||||||
plain | |||||||
I had originally considered separating mirrors and RAID5, even placing mirrors at the end of the disk, but decided that probably wouldn't be a good match with most use cases.
If you decide you need more RAID5 (maybe you bought the boxed set of Stargate SG-1) you just ask the system to allocate some more RAID5 extents from the beginning of the grey area.
If it is the beginning of the month and you are allocating a new monthly backups partition, you just allocate some extents from the end of the grey area.
If you start a new art project and need to expand your home directory with more mirrored space, allocate from the beginning of the grey area.
You can think of it as a highly granular mix of RAID and LVM. You create RAIDs that occupy a small subset of a larger partition. Each partition has a mix of RAID5, mirror, stripes, and plain extents.
I consider it unlikely that I would use pure striping for any of my data.
sda | sdb | sdc | sdd | sde | sdf |
RAID5 mirror | |||||
unused | |||||
striped/plain | |||||
Considering just the RAID section of the disk we have a couple of alternatives:
If there were a RAID5 extent that was blocks 0-99 of sd[abcde], then after the expansion it would occupy blocks 0-83 of sd[abcdef]. If there were a RAID10 array occupying blocks 100-119 of sd[abcde] then it would be rewritten in blocks 84-100 of sd[abcdef].
Recent linux kernels already have the ability to add a drive to a RAID5. It is not a stretch to imagine code that could reshape these more granular RAIDs.
sda | sdb | sdc | sdd | sde | sdf |
RAID5 | |||||
mirror | |||||
unused | |||||
Where this breakpoint is would be a matter of taste, but probably the smallest break point for RAID5 would be the conversion from 5 to 6 drives. Before the fracture you would have 4+1, and after the fracture you would have 2+1 & 2+1. The annoying part about this fracture operation is that it doesn't actually expand the available data space (2+2=4), because the new drive is consumed by a new checksum ( from +1 to +1&+1).
sda | sdb | sdc | sdd | sde | sdf |
RAID5 | RAID5 | ||||
unused | |||||
The important thing motivating a fracture is that the number of logical blocks affected by a potential hard drive failure would be reduced. Actually planning the block copies for a fracture of a live extent is an interesting problem. I have not given it much thought.
If you need more RAID space than you can allocate from the contiguous grey area you can shuffle some plain extents from the beginning of the green section into the end of the sdf disk.
If you are not in immediately need of disk space, you could just leave the plain extents where they are and allow the allocation of new plain extents to happen in the tail end of sdf. This strategy will work best if you are routinely allocating new plain extents (backups, periodic archives, whatever).
The RAID chunks at the beginning of the disk would be roughly analogous to physical volumes (PVs) in an LVM system. It would include a list of the disks involved and the block ranges on each. It would also have to be capable of representing both expand+pack and fracture transition states in a journalled fashion. A fracture would result in the creation of two PVs.
Since RAID5 is designed to protect against single-drive failure you could keep copies of the metadata on two drives. If both of those drives fail you are boned anyway. Alternatively, disk is relatively cheap, so you might as well keep copies on all of the involved disks, if only to document your boned-ness.
The end of each disk can be imagined as its own PV with logical volumes allocating chunks from within each PV.
# pvqcreate -l 5 -n pvr5_6 -L 300G --vg mg00 /dev/sd[abcde]5 # lvqcreate -n lv_homes5 --all mg00 pvr5_6 # mkfs.ext3 /dev/mg00/lv_homes5 # mount /dev/mg00/lv_homes5 /homes5 |
# pvqcreate -l 1+0 -n pvr1_7 -L 50G --vg mg00 /dev/sd[abcde]5 # lvqextend --all pvr1_7 mg00/lv_homes2 # resize2fs /dev/mg00/lv_homes2 |
# pvqextend mg00 --old /dev/sd[abcde]5 --new /dev/sdf5 --pvs pvr1_0 pvr1_1 pvr5_2 pvr5_3 pvr5_4 pvr5_5 pvr5_6 pvr1_7 |
All of the PVs at the end of that command which have space on each of sd[abcde]5 would be expanded to include sdf5.
The syntax seems complicated because I have to consider the case that some PVs will not span all 5 drives. Those ones should not be expanded onto /dev/sdf by that command.
The following command shuffles some plain extents (the LVs whose name begins with bk_) onto the new drive:
# pvqmove --pack-plain vg00 /dev/sd[abcdef]5 -- `lvqdisplay -c | awk -F: '{print $1}' | grep /bk_` |
The --pack-plain wizard of the pvqmove command would find all plain (non-RAID which live at the ends of disks) PVs in vg00 living on the 6 drives and one by one relocate the one closest to the beginning of a disk into a free chunk near the end of any disk. It would stop as soon as the relocation would cause the relocated chunk to start earlier than the existing location of the chunk
While there are commercial products which do this for large organizations right now it might be hard to accomplish with off-the-shelf parts. I imagine a tower case with 8 drives inside, plus four PCI-express 1x eSATA with four ports each would support eight Sans Digital TR8M chassis each with 8 drives inside would give you 72 drives. Start filling up the plain PCI slots on your motherboard and you could break 100.
In my personal situation I have TR4U that has a sum of 1500 GB of disk inside it. I can replace the entire unit today with a hard drive that costs less than the chassis without the hard drives.
The point at which it's cheaper to retire a disk drive than buy more chassis will vary based on how fast you buy disk. For me the horizon is probably 8 disks.
I would recommend that you not make a volume group which spans two chassis. If a power supply or cable failed on one it would be quite inconvenient to have to leave other disks off-line because they were part of a RAID5 that included the crippled chassis.