ZFS on FreeBSD

https://www.freebsd.org/doc/handbook/zfs.html

The ZFS file system originated on Solaris, and is now well supported on FreeBSD. It combines the role of file system and volume manager.

ZFS design goals:

Data integrity: data is checksummed when written and read. ZFS attempts to automatically correct errors if redundancy is available.
Pooled storage: physical storage devices are added to a pool available to all file systems; storage is allocated from that pool, and the pool can grow by adding new devices.
Performance: from multiple caches systems. ARC is a ram-based read cache. L2ARC is a disk-based read cache. ZIL is a disk-based write cache.

Basic Setup and Examples

In /etc/rc.conf:

zfs_enable="YES"

Start zfs

# service zfs start

Create a simple, non-redundant single-disk pool:

# zpool create mypool /dev/da0

This gets mounted as /mypool (which we can see in df or mount). We can create files on it, but we’re not really getting any zfs benefits yet.

(It’s not the best practice to create files directly in the pool. We should keeps files in datasets.)

Show stats and status:

# zpool list
# zpool status

Create a dataset with compression:

# zfs create mypool/compress
# zfs set compression=gzip mypool/compress

ZFS automatically compresses files written to /mypool/compress/.

Turn off compression:

# zfs set compression=off mypool/compress

Unmount and mount like:

# zfs umount mypool/compress
# zfs mount mypool/compress

Many ZFS features can be set per dataset. For example, we can keep redundant copies of files in a dataset:

# zfs create mypool/veryimportantfiles
# zfs set copies=2 mypool/veryimportantfiles

Note that mypool/compress and mypool/veryimportantfiles share the same storage pool, and df will show the same Used and Avail values for both.

Destroy both file systems, then destroy the pool:

# zfs destroy mypool/compress
# zfs destroy mypool/veryimportantfiles
# zfs destroy mypool

ZFS supports snapshotting:

# zfs snapshot mypool/foo@2015-10-30

…where the apetail is a delimiter separating the filesystem from the snapshot name. This snapshot will be stored in /mypool/foo.zfs/snapshot/.

File systems can be rolled back to a snapshot:

# zfs rollback mypool/foo@2015-10-30

Delete snapshots to free space like:

# zfs destroy mypool/foo@2015-10-30

Show IO statistics (average since boot):

%  zpool iostat -v

Show IO statistics (current, 5 second intervals):

%  zpool iostat -v 5

Pool and vdev Types

A storage pool is made of one or more vdevs. The vdevs can be single or multiple physical disks. Types of vdevs:

Disk. Can be an entire disk or just a partition. (There’s no performance penalty for using just a partition.)
File. A regular file(s). Good for testing/experimentation. Minimum 128 MB in size.
Mirror. Two or more devices. All data written to all devices. Total size equals the size of the smallest member.
- A single disk vdev can be upgraded to a mirror (zpool attach).
RAID-Z. Similar to RAID-5, but offers superior parity distribution. Three types:
- RAID-Z1
- RAID-Z2
- RAID-Z3
Spare. Pseudo-vdev type for hot spares. Hot spares are not deployed automatically; use zfs replace.
Log. Moves the ZIL (ZFS Intent Log) from the regular pool to a dedicated device/vdev, typically a fast SSD.
Cache. Storage cache for the L2ARC. Caches can’t be mirrored (a cache only duplicates saved data anyhow).

Note that we can’t easily grow a vdev after we create it short of backup/destroy/recreate. Of course, we can create a new vdev and add it to a pool, but the new vdev will need its own redundancy.

Because ZFS stripes data across vdevs in the pool (with the obvious exception of spares, log, and cache vdevs), a pool will not survive the death of a vdev (i.e. redundancy happens at the vdev level, not the pool level).

The larger the size of a vdev (in terms of number of disks or size of disks??), the longer it takes to resilver (rebuild).

In most use cases, mirrors will be faster than RAID-Z configurations.

RAID-Z

RAID-Z requires three or more disks. (Sun recommends 3-9. For 10+ disks, break the pool into smaller RAID-Z groups. For two disk redundancy, use a ZFS mirror.) See Pool and vdev Types below.

RAID-Z comes in three flavors — RAID-Z(1), RAID-Z2, RAID-Z3 — where the number indicates the count of parity disks in the vdev.

Number of disks. We use a power of two for our data sets plus a number of parity disks equal to our RAID-Z level. For example:

RAID-Z:  3, 5, 9,  17, 33 drives
RAID-Z2: 4, 6, 10, 18, 34 drives
RAID-Z3: 5, 7, 11, 19, 35 drives

(I’ve read that this proscription for the number of disks is not true when using the large_blocks feature with 1MB records rather than 4K sectors, but I don’t know why that would be true.)

(Sun recommended not using more than nine disks for a vdev, but I’m not sure why or if this would still be true. Maybe for performance reasons — see next paragraph.)

Generally, there is a trade-off between IOPS (random I/O speed) and disk space. For better IOPS, use fewer disks per vdev (and, correspondingly, more space lost to parity).

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

RAID-Z Configuration Trade-Offs

             Space
               ^
              / \
             /   \
Performance <-----> Redundancy

Cache Types

ARC. Adaptive Replacement Cache. First level of caching stores Most Recently Used items and Most Frequently Used items. ARC resided in RAM. The amount of RAM used/ideal depends on the size and busyness of the storage pool, but the more the better; a sizing rule of thumb is at least 1 GB of RAM for each TB of storage pool (5 GB RAM per TB for deduplication), with an absolute minumum of 4 GB (and perhaps a practical minumum of 8 Gb) just for ZFS beyond what the OS and apps might need.

L2ARC. The second level adaptive replacement cache is disk-based (e.g. SSD) overflow for the ARC. If all our drives are SSD’s anyhow, there’s not much point. A read cache, so device mirroring isn’t critical for redundancy. L2ARC is a nice option, but it’s still better to have more RAM for top-level ARC.

# zpool add data cache /dev/ada0p1

ZIL. ZFS Intent Log. For synchronous/confirmed (e.g. databases, NFS, some vm needs); asynchronous/unconfirmed writes never hit the ZIL. A write cache, so device mirroring for redundancy is advisable. Good candidate for a pair of SSD’s if the bulk of your pool is spinning rust.

# zpool add data log /dev/ada0p2

(UPDATE: Although it might be best to mirror the ZIL, it’s probably not critical. ZFS only writes to the ZIL for synchronous writes. Even then, the ZIL itself is a failsafe; the data stays in RAM even after being written to the ZIL. In the normal course of events, those writes get written from RAM. The ZIL is only read when an exception, like a power failure, compromises the data in RAM. Sure, if it’s a mission-critical production database, mirror the ZIL; otherwise, it’s not absolutely necessary, especially if the box is on a UPS.)

Transaction Groups

ZFS holds writes in memory, and flushed them all to disk as part of one transaction group. This happens about every five seconds. ZFS does this to minimize slow, random output by grouping it up as faster sequential writes.

Hardware RAID

Don’t use hardware RAID under ZFS. ZFS likes raw disk access, and takes care of its own redundancy.

ECC Memory

Use of ECC RAM is strongly recommended for machines running ZFS, because ZFS assumes it’s getting good data from RAM during writes. There is some dispute about how critical ECC memory is for ZFS, but:

For critical data and business application, just use ECC.
If your motherboard and processor support it, just use ECC.
In the case of a low-load home file server, it’s recommended but maybe not worth buying a new motherboard and processor.

Is ZFS any more susceptible to the perils of non-ECC RAM than other filesystems? Maybe slightly. There are various catastrophic failure scenarios floating around, but the only totally convincing factor I’ve heard is that ZFS holds writes in RAM (transaction groups) slightly longer than many other filesystems, which might make corruption a little more likely.