ZFS Performance Focused Parameters

We've recently gotten some significantly larger storage systems and after running some 50T pools with basically all the defaults it felt like time to dig into what common options are used to chase performance. The intended use for these systems is ultimately CIFS/NFS targets for scientists who are running simulations that generate small (1M) to large (100G) files. I'm not being scientific and offering any benchmarks, just digging into documented performance parameters and explaining the rational.

Alightment Shift:
Alignment Shift needs to be set properly for each vdev in a pool.

Configuring ashift correctly is important because partial sector writes incur a penalty where the sector must be read into a buffer before it can be written

You should query your drives to understand if they are advanced format (4K, 8K) and set the ashift parameter to 12 or 13 respectively.

cat /sys/class/block/sdX/queue/physical_block_size
cat /sys/class/block/sdX/queue/logical_block_size

You can find many reports of significant performance impact for people who've ignored setting ashift or forcibly set ashift to 0. There is an overhead to setting this parameter, you will have a reduction in raw available storage from each device that is advanced format.

Spares and Autoreplace:
If you add spares to the pool you can tell ZFS to automatically resilver onto those spares. There is some discussion of how this works in ZoL.

RAIDZ(1,2,3), Mirrored RAIDZ(1,2,3), or Striped Mirrors (RAID10):
There are some good write ups that get into comparative performance, but the major items to consider here are:

With growth, you have to anticipate what your ability is to provision more vdevs for a system. It is possible to have heterogenous vdevs within a pool (however I'm unsure of the use case for this as ZFS will favor the larger vdev and your pool will become imbalanced), generally you'd increase a pool by adding vdevs. If we assume homogenous vdevs, then the vdev size is your target for growth. Choosing 12 drive vdevs means that growth requires you to purchase 12 drives. In a professional environment this isn't a terrible pressure, but in personal use I've rarely exceeded 8 drive vdevs because it's a forced strain of purchasing 8 more drives if I want to grow or replace the pool.

With resilvering, you have to consider how tolerant your ecosystem is to downtime. It can take a significantly non-trivial period of time for a large vdev raidz(1,2,3) to resilver after a device failure. If you're not prepared to tolerate 36-48 hours of degraded performance or downtime you should not be wading into raidz.

For these large systems we're initially provisioning 12 drives with two hot spares. We're pretty unwilling to tolerate downtime in excess of a single day and we're generally planning to grow by 2-12 drive increments. Based on this we're going to create a striped mirror as it offers us the best possible read performance, scrub/resilver performance, and simplest growth strategy.

Compression:
Compression is a sort of "free lunch" in ZFS. There are some discussions online where compression can become a convolution in relation to distributing parity for raidz schemes, however I've not found a strong case for generally not using compression where instead you find a deluge recommending it. LZ4 is the default compression strategy in OpenZFS and it is incredibly fast.

Normalization:
This property indicates whether a file system should perform a unicode normalization of file names whenever two file names are compared, and which normalization algorithm should be used. There are performance benefits to be gained with ZFS is doing comparison that involved unicode equivalence. This is a less documented parameter that shows up oftent in discussion. The mailing list for ZoL has some discussion about it, however I've been unable to dig up anything better than Richard Laager's reply. It appears that most people are selecting formD so we'll throw our lot in with that as a more heavily tested pathway.

Prevent Mounting of the root dataset:
You generally never want to use the root dataset to store anything, some dataset options can never be disabled. So, if you enable a dataset option on the root pool and you have files there you'd have to destroy the entire pool to clear it, instead of destroying just a dataset. This can be done by specifying -O canmount=off. If the canmount property is set to off, the file system cannot be mounted by using the zfs mount or zfs mount -a commands. Setting this property to off is similar to setting the mountpoint property to none, except that the dataset still has a normal mountpoint property that can be inherited.

Extended File Attributes:
ZFS can greatly benefit from setting the xattr attribute to sa. There is some discussion here of its implementation in ZoL. Further discussion here with a good quote:

The issue here appears to be that -- under the hood -- Linux doesn't have a competent extended attribute engine yet. It's using getfactl/getxattr/setfacl/setattr; the native chown/chmod utilities on Linux don't even support extended attributes yet. Andreas Gruenbacher -- the author of the utilities -- clearly did a great job implementing something that doesn't have great support in the kernel.

Access Time Updates:
Some people turn atime off, we will instead leverage relatime to make ZFS behave more like ext4/xfs where access time is only updated if the modified or changed time changes.

Recordsize:
Most of the discussion for recordsize focuses on tuning for databse interactions. We're providing this storage to users that will be writing primarily large files. The idea with recordsize is you want to set it as high as possible while simultaneously avoiding two things:

  • read-modify-write-cycles: where ZFS has to write a chunk that is smaller than its recordsize. For example if your ZFS recordsize is set to 128k and innodb wants to write 16k, ZFS will have to read the entire 128k, modify some 16k within that 128k, then write back the entire 128k.
  • write amplification: when you're doing read-modify-write you're writing far more than you intended to, for the above example its a multiplicitive of eight.

There is a good discussion about this with Allan Jude. recordsize should really be set per dataset, as datasets should be created for specific purposes. There is a discussion here, from which I used the following:

find . -type f -print0                                                   \ 
 | xargs -0 ls -l                                                        \
 | awk '{ n=int(log($5)/log(2));                                         \
          if (n<10) n=10;                                                \
          size[n]++ }                                                    \
      END { for (i in size) printf("%d %d\n", 2^i, size[i]) }'           \
 | sort -n                                                               \ 
 | awk 'function human(x) { x[1]/=1024;                                  \
                            if (x[1]>=1024) { x[2]++;                    \
                                              human(x) } }               \
        { a[1]=$1;                                                       \ 
          a[2]=0;                                                        \
          human(a);                                                      \
          printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

Which produced this for a dataset:

  1k:  77526
  2k:  26490
  4k:  26354
  8k:  35760
 16k:  15681
 32k:  12206
 64k:   8606
128k:   8740
256k:  12421
512k:  19919
  1M:  15813
  2M:  10342
  4M:  13070
  8M:   7604
 16M:   2981
 32M:    988
 64M:   1062
128M:    560
256M:    711
512M:    498
  1G:    107
  2G:     17
  4G:     17
  8G:      6
 16G:      3
 32G:      4
 64G:      2

In this case I'm going to leave recordsize the default 128k and pray that compression saves me.

Final Result:
You specify pool options with -o and root dataset options with -O:

zpool create -o ashift=12 -o autoreplace=on -O canmount=off -O mountpoint=/tank -O normalization=formD -O compression=lz4 -O xattr=sa -O relatime=on tank mirror /dev/mapper/disk0 /dev/mapper/disk1

Expanding that out:

zpool create \
-o ashift=12 \
-o autoreplace=on \
-O canmount=off \
-O compression=lz4 \
-O normalization=formD \
-O mountpoint=/tank \
-O xattr=sa \
-O relatime=on \
tank mirror /dev/mapper/disk0 /dev/mapper/disk1

Adding another mirror device:

zpool add -o ashift=12 tank mirror /dev/mapper/disk2 /dev/mapper/disk3

Adding a hot spare:

zpool add -o ashift=12 tank spare /dev/mapper/disk4