paulgorman.org/technical

Linux Software RAID with UEFI Boot

(Sept 2016)

Software RAID has become in many ways superior to hardware RAID. UEFI doesn’t understand software RAID. With Linux md RAID, we can RAID partitions rather than entire physical disks without any performance penalty. So long as we use relatively simple RAID configurations (e.g. a RAID 1 mirror), it’s practical to manually maintain redundancy for non-RAID EFI boot partitions.


UPDATE: Mirrored UEFI Boot Partition

Although installers don’t (or didn’t) make it easy, MD can mirror an ESP partition. The catch is that it’s possible for the firmware itself to write to the partition, which would make the RAID inconsistent. Most OS installers refuse to put /boot/efi/ on an MD mirror because of that possibility of inconsistency.

https://www.reddit.com/r/linuxadmin/comments/8mggnd/what_is_the_state_of_the_art_for_software/ https://unix.stackexchange.com/questions/265368/why-is-uefi-firmware-unable-to-access-a-software-raid-1-boot-efi-partition


Partitioning Disks

Say we have a couple of disks that we want to mirror.

The disks must be GPT partitioned, of course, not MBR. (In parted: do mklabel gpt.) The machine should be booted in EFI mode, not “compatibility” mode.

If using parted, start it with parted -a optimal to assure alignment with disk topology. Where we use set 1 or set 2 below, these refer to partition numbers, which we can verify in parted with print. Get device names with print devices and select the device like select /dev/sdb.

Note that we can specify units in parted like unit MiB, which we may want to do before printing partition or device info.

  1. Create a 400-500 MB partition at the start of both drives. Set the partition type as “EFI boot partition”. This should be one of the filesystem type options in the Debian installer. In parted, the command would be like mkpart ESP fat32 1MiB 512MiB and set 1 boot on. One of these partitions (e.g. /dev/sda1) will get mounted at /boot/efi.

  2. Create a RAID partition on the rest of the disk. The Debian installer helps with creating a raid. With parted, do mkpart primary 513MiB -0 (negative number should count back from the end of the device), then set 2 raid on.

To actually create the RAID, either use the installer tool, or:

# mdadm --create --verbose /dev/md0 --level=mirror --raid-devices=2 /dev/sda2 /dev/sdb2
# mdadm --detail /dev/md0

Create efi entry for other ESP partition(s), like:

# efibootmgr --create --disk /dev/sdb --part 1 -w --label linux-sdb --loader \EFI\debian\grubx64.efi

See boot entries:

# efibootmgr -v

Note that efibootmgr is one option to create boot entries in the machine firmware. Most machines include a utility in the firmware “BIOS” to create such entries. We could create boot entries directly in the firmware, rather than using efibootmgr.

Syncing EFI Boot Partitions

Set a cron job to run. Something like:

0  18 *   *   7     /root/bin/mirror-efi-partition.sh

Syncing the EFI partitions only becomes important after a kernel update. It may be possible to create a hook in our package manager to handle this, but the cron job is safe enough for the lazy.

The shell script:

#!/bin/sh
/usr/sbin/update-grub > /dev/null
/usr/sbin/grub-install --recheck --bootloader-id linux > /dev/null
dd if=/dev/sda1 of=/dev/sdb1 bs=1M

Or a root cron job, like:

  0  18 *   *   7     /usr/sbin/grub-install --recheck --bootloader-id linux-sdb /dev/sdb > /dev/null ; /usr/sbin/grub-install --recheck --bootloader-id linux-sda /dev/sda > /dev/null ; /usr/sbin/update-grub > /dev/null

grub-install should create the efi boot entries. We can double-check this like:

%  efibootmgr -v

Example with RAID10

Assume we have a machine with four drives. Each drive starts with a small non-RAIDed ESP partition, with a large second partition belonging to a md RAID 10.

When we first set up the box, and perhaps periodically or after a system update that affects GRUB, we run a script to make each drive bootable, with its own EFI boot menu entry. Something like:

#!/bin/sh
set -euf

d=$(/bin/lsblk /dev/sdd1 -o PARTUUID -n)
c=$(/bin/lsblk /dev/sdc1 -o PARTUUID -n)
b=$(/bin/lsblk /dev/sdb1 -o PARTUUID -n)
a=$(/bin/lsblk /dev/sda1 -o PARTUUID -n)

cd=$(echo "$d" | /usr/bin/rev | /usr/bin/cut --characters=-8 | /usr/bin/rev)
cc=$(echo "$c" | /usr/bin/rev | /usr/bin/cut --characters=-8 | /usr/bin/rev)
cb=$(echo "$b" | /usr/bin/rev | /usr/bin/cut --characters=-8 | /usr/bin/rev)
ca=$(echo "$a" | /usr/bin/rev | /usr/bin/cut --characters=-8 | /usr/bin/rev)

mkdir -p /boot/efi
umount /boot/efi

mount PARTUUID="$d" /boot/efi
/usr/sbin/grub-install --recheck --bootloader-id=debian-"$cd" PARTUUID="$d"
umount /boot/efi

mount PARTUUID="$c" /boot/efi
/usr/sbin/grub-install --recheck --bootloader-id=debian-"$cc" PARTUUID="$c"
umount /boot/efi

mount PARTUUID="$b" /boot/efi
/usr/sbin/grub-install --recheck --bootloader-id=debian-"$cb" PARTUUID="$b"
umount /boot/efi

mount PARTUUID="$a" /boot/efi
/usr/sbin/grub-install --recheck --bootloader-id=debian-"$ca" PARTUUID="$a"

/usr/sbin/update-grub

Finding a Failed Drive

Find the serial number of the failed drive in the RAID, and match it physically to the drive.

# mdadm --detail /dev/md0
# hdparm -i /dev/sda

(Also, mdadm --detail returns a nonzero status for an array with any type of problem.) (By default, on Debian, /etc/cron.daily/mdadm runs mdadm --monitor --scan --oneshot and reports errors by email.)

A Refresher on Partition Alignment

Non-aligned partitions may significantly degrade performance.

Traditional spinning disks contain multiple platters stacked on a spindle, with one dedicated read/write head per platter. Each platter is divided into many concentric circles, called tracks. Each tracks is subdivided into a number of segments, each segment a small slice of the circular track. Traditionally, segments held 512B of data, although some new drives use 4KB tracks. All tracks at the same position on each platter (i.e., the same distance from the spindle), are conceptually grouped as a cylinder.

A sector is the minimum amount of data that can be read/written from/to a disk.

Although solid state drives don’t have physical platters, etc., we still discuss them in terms of segments.

How is storage addressed by the OS? At one time, CHS (cylinder, head, sector) information (including complications such as variable numbers of sectors per track) was exposed up to a fairly high level (OS level); this type of addressing was also called “geometry-based access”. These days, drive firmware abstracts storage details from the OS as Logical Block Addressing. With LBA, the OS only sees a storage device as continuous series of single-size blocks. Various disk bus protocols, such as SATA or SCSI, all communicate with drives using an LBA schema.

So, up until recently, most drives used a sector size of 512 bytes. Recently, the Advanced Format standard has emerged that accommodates today’s much larger storage needs with a standard sector size of 4096 bytes. The physical sector size of a disk is also called its natural block size (NBS). Confusingly, some new disks with a 4KB physical sector size misreport their sector size as 512B.

The block size defined for volumes created by storage subsystems, like iSCSI LUNs, need not necessarily match the clock size of the underlying physical storage. For LUNs, select a block size based on the application (e.g., 4-8KB for online transaction processing, or 64-128KB for sequential/streaming workloads).

An OS groups blocks of underlying physical storage into partitions (called “slices” on some OS’s). The old MBR partition scheme supports up to four partitions per drive, and defines each partition in terms of its length and start location (which, in turn, are defined by a Cylinder-Head-Sector address). Note that MBR counts starting from one rather than zero. The new GPT partition scheme supports more partitions that MBR, and partition larger than 2TB. GPT uses LBA rather than CHS addressing to define partition locations. GPT counts from zero rather than one.

The key idea is for the start of each partition/slice to align with the start of a physical sector (NBS) of the disk. This is slightly complicated as we add storage layers. For example, our RAID stripe size should align with the native block size.

A further complication is that many SSD’s only erase data 256 or 512K blocks.

Many modern OS’s side-step these complications by safely starting at a 1MB boundary (i.e., 4096K times 256, or 512K times 2048), which accounts for 512K or 4K sectors, and various SSD idiosyncrasies.

On a mis-aligned partition, reads or writes that unnecessarily cross sector boundaries effectively double I/O load and halve real-world performance.

Check partition alignment with:

# fdisk -lu

See also the output of:

$ cat /sys/block/sda/queue/optimal_io_size
$ cat /sys/block/sdb/alignment_offset
$ cat /sys/block/sda/queue/physical_block_size

Where (optimal_io_size + alignment_offset) / physical_block_size should provide an ideal starting sector for the first partition.

The parted utility is pretty good about automatically aligning partitions, particularly with the --align optimal flag.