merging RAID into LVM, as imagined by Bob

I use RAID. I use LVM. I do things with them that would seem downright abusive. I have imagined a way of automating and idiomizing much of my digital gymnastics. I suspect I am not the first to imagine it. I would not be surprised if IBM has a 20-year-old product that does all this (and more) already. I assume drobo has pieces of this already, but it's somewhat proprietary.

My primary motivation for writing this is to get the idea off my chest, and then let someone point out the original implementation of what I have independently reinvented.

The separation of responsibilities.

I am considering the following independent layers of storage:

layer

filesystem ext3, reiser, xfs, zfs

LVM lv/vg/pv

RAID striping, mirroring, checksums

disk partitions

layer
filesystem	ext3, reiser, xfs, zfs
LVM	lv/vg/pv
RAID	striping, mirroring, checksums
disk partitions

The separation of the layers simplifies the code of each layer, makes it easier to test, reduces the number of dark corners in which bugs will hide, and makes it easier to do something new with each layer.

However, there are inefficiencies that come from these nice clean separations. When you perform a pvmove in the LVM layer, it will move blocks that the filesystem is not using. Strictly speaking this is a waste of time and spindle bearing life. When you add a drive to a RAID (either to replace a failed drive, or to move the partitions around) it computes checksums for blocks that are not in use by the LVM layer.

The simplicity afforded by the separation of layers has performance costs.

Use case

I add a hard drive to my computer. I want to be able to create or expand a filesystem using RAID5, RAID1, or RAID0 with a minimum of commands.

I routinely (once or twice a year) add a hard drive to my house file server in order to expand my capacity. In order to integrate this new hard drive I perform operations at several layers of the storage system (mostly the LVM and RAID layers). I think that if I could merge LVM and RAID it would simplify this expansion.

proposed architecture

Consider a system with 5 drives

sda	sdb	sdc	sdd	sde
RAID5 mirror stripe
unused
plain

RAID5, mirrors, and stripes are allocated from the beginning of the disk. Plain storage lives at the end of the disk.

I had originally considered separating mirrors and RAID5, even placing mirrors at the end of the disk, but decided that probably wouldn't be a good match with most use cases.

If you decide you need more RAID5 (maybe you bought the boxed set of Stargate SG-1) you just ask the system to allocate some more RAID5 extents from the beginning of the grey area.

If it is the beginning of the month and you are allocating a new monthly backups partition, you just allocate some extents from the end of the grey area.

If you start a new art project and need to expand your home directory with more mirrored space, allocate from the beginning of the grey area.

You can think of it as a highly granular mix of RAID and LVM. You create RAIDs that occupy a small subset of a larger partition. Each partition has a mix of RAID5, mirror, stripes, and plain extents.

Striping -vs- plain

Striping provides you with increased performance, but at the cost of increased failure exposure. If you had backups for March on one disk and April on the other the failure of one disk would destroy only one backup set. If both were striped a disk failure would destroy both months.

I consider it unlikely that I would use pure striping for any of my data.

then things get interesting

Imagine that you add a 6th disk.

sda sdb sdc sdd sde sdf

RAID5
mirror

unused

striped/plain

Considering just the RAID section of the disk we have a couple of alternatives:

expand and pack

Expand every RAID5 and mirror/stripe extent onto the new drive, repacking as we go.

If there were a RAID5 extent that was blocks 0-99 of sd[abcde], then after the expansion it would occupy blocks 0-83 of sd[abcdef]. If there were a RAID10 array occupying blocks 100-119 of sd[abcde] then it would be rewritten in blocks 84-100 of sd[abcdef].

Recent linux kernels already have the ability to add a drive to a RAID5. It is not a stretch to imagine code that could reshape these more granular RAIDs.

sda sdb sdc sdd sde sdf

RAID5

mirror

unused

fracture

I would like this to be able to scale to a machine with so many hard drives that a RAID5 across all of them would no longer be sensible.

Where this breakpoint is would be a matter of taste, but probably the smallest break point for RAID5 would be the conversion from 5 to 6 drives. Before the fracture you would have 4+1, and after the fracture you would have 2+1 & 2+1. The annoying part about this fracture operation is that it doesn't actually expand the available data space (2+2=4), because the new drive is consumed by a new checksum ( from +1 to +1&+1).

sda sdb sdc sdd sde sdf

RAID5 RAID5

unused

The important thing motivating a fracture is that the number of logical blocks affected by a potential hard drive failure would be reduced. Actually planning the block copies for a fracture of a live extent is an interesting problem. I have not given it much thought.

shuffle

Once you've decided what to do with the RAID extents, you will consider what to do with the plain extents that live near the end of the disk.

If you need more RAID space than you can allocate from the contiguous grey area you can shuffle some plain extents from the beginning of the green section into the end of the sdf disk.

If you are not in immediately need of disk space, you could just leave the plain extents where they are and allow the allocation of new plain extents to happen in the tail end of sdf. This strategy will work best if you are routinely allocating new plain extents (backups, periodic archives, whatever).

storage of metadata

When deciding how to store metadata you must take into consideration not just a static situation, but how to represent the transition states

The RAID chunks at the beginning of the disk would be roughly analogous to physical volumes (PVs) in an LVM system. It would include a list of the disks involved and the block ranges on each. It would also have to be capable of representing both expand+pack and fracture transition states in a journalled fashion. A fracture would result in the creation of two PVs.

Since RAID5 is designed to protect against single-drive failure you could keep copies of the metadata on two drives. If both of those drives fail you are boned anyway. Alternatively, disk is relatively cheap, so you might as well keep copies on all of the involved disks, if only to document your boned-ness.

The end of each disk can be imagined as its own PV with logical volumes allocating chunks from within each PV.

allocation idioms

The sample command lines below are completely imaginary since I have not yet encountered an implementation of this. They are loosely modeled on LVM commands, but I have jammed a q in the middle so no one ends up in this morass if they're googling for help on LVM.

new RAID5

I (like many others) use RAID5 to make fingerprint-proof copies of my DVDs. I create a filesystem of few hundred gigabytes and after buying DVDs for a few months it is full. Although the BAARF lobby would admonish me for using RAID5 I find it an acceptable trade-off between convenience and security. I'm too cheap to mirror, but I'm also not willing to bet against a hard drive failure.

# pvqcreate -l 5 -n pvr5_6 -L 300G --vg mg00 /dev/sd[abcde]5 # lvqcreate -n lv_homes5 --all mg00 pvr5_6 # mkfs.ext3 /dev/mg00/lv_homes5 # mount /dev/mg00/lv_homes5 /homes5

expand a mirror

Your art workspace is getting a little cramped. You want some more mirrored space.

# pvqcreate -l 1+0 -n pvr1_7 -L 50G --vg mg00 /dev/sd[abcde]5 # lvqextend --all pvr1_7 mg00/lv_homes2 # resize2fs /dev/mg00/lv_homes2

add a hard drive

Your drive farm is getting cramped. You have added a new disk to prepare for new filesystems or expansions of existing filesystems.

# pvqextend mg00 --old /dev/sd[abcde]5 --new /dev/sdf5 --pvs pvr1_0 pvr1_1 pvr5_2 pvr5_3 pvr5_4 pvr5_5 pvr5_6 pvr1_7

All of the PVs at the end of that command which have space on each of sd[abcde]5 would be expanded to include sdf5.

The syntax seems complicated because I have to consider the case that some PVs will not span all 5 drives. Those ones should not be expanded onto /dev/sdf by that command.

The following command shuffles some plain extents (the LVs whose name begins with bk_) onto the new drive:

# pvqmove --pack-plain vg00 /dev/sd[abcdef]5 -- `lvqdisplay -c | awk -F: '{print $1}' | grep /bk_`

The --pack-plain wizard of the pvqmove command would find all plain (non-RAID which live at the ends of disks) PVs in vg00 living on the 6 drives and one by one relocate the one closest to the beginning of a disk into a free chunk near the end of any disk. It would stop as soon as the relocation would cause the relocated chunk to start earlier than the existing location of the chunk

How big is big?

Ideally you would want to be able to add hundreds of disks to your fileserver.

While there are commercial products which do this for large organizations right now it might be hard to accomplish with off-the-shelf parts. I imagine a tower case with 8 drives inside, plus four PCI-express 1x eSATA with four ports each would support eight Sans Digital TR8M chassis each with 8 drives inside would give you 72 drives. Start filling up the plain PCI slots on your motherboard and you could break 100.

In my personal situation I have TR4U that has a sum of 1500 GB of disk inside it. I can replace the entire unit today with a hard drive that costs less than the chassis without the hard drives.

The point at which it's cheaper to retire a disk drive than buy more chassis will vary based on how fast you buy disk. For me the horizon is probably 8 disks.

-

I would recommend that you not make a volume group which spans two chassis. If a power supply or cable failed on one it would be quite inconvenient to have to leave other disks off-line because they were part of a RAID5 that included the crippled chassis.