This how-to will show you how to configure:
- Install ZFS via the ZFS on Linux project
- Create and administer your ZFS data pools
- Monitor disk health
Build considerations & preparation
Hardware plays a large role in the performance and integrity of the your ZFS file server. Although ZFS will function on a variety of commodity hardware, you should consider the following before proceeding:
The question of using non-ECC RAM gets asked again and again, but the bottom line is you do need it. ZFS does its best to protect your data and ensure its integrity, but it cannot do so if the memory it uses cannot be trusted. ZFS is an advanced filesystem that can self-heal your files when silent bit rot occurs (bit flips on the disk from bad sectors or cosmic rays). When this error is discovered, it can attempt to self-heal the file. What if the information on disk is OK, but an undetected bit flip occurs in your RAM? ZFS could attempt to "self-heal" and actually cause a corruption in your data because the information it received from RAM was incorrect.
ZFS will run just fine without ECC RAM but you run the risk of silent data corruption and (although very unlikely) losing the zpool entirely if your metadata gets corrupted in RAM and then subsequently written to disk. The chance of random bit flips of small, but if your RAM stick is going bad and is riddling your filesystem with errors, you do not want to run the risk of catching that too late and losing everything.
Keep in mind that in order to use ECC RAM, you must buy a motherboard AND CPU that both support it. There are also buffered (also known as registered) DIMMS and non-buffered DIMMS. Buffered DIMMS tend to be slower, more expensive, but scale much better (e.g. a single board could support up to 192GB RAM) while unbuffered ECC RAM tends to be less expensive, performs better but doesn't scale as high (maximum of 32GB RAM on most current boards).
A more detailed analysis on this topic is available in this FreeNAS forum post.
Sufficient RAM for ARC cache
Conventional wisdom is that you should plan to allocate 1GB RAM per TB of usable disk space in your ZFS filesystem. ZFS will run on far less (i.e. 4GB), however then you have little space available for your ARC cache and your read performance may suffer. Plan ahead and buy enough RAM from the start, or be sure that you'll be able to get your hands on additional DIMMs later if you plan on adding additional disks later.
ZFS offers RAID modes similar to RAID 0, RAID 1, RAID 5, RAID 6. ZFS uses the terms stripe, mirror, RAIDZ1 and RAIDZ2 respectively. It also offers a new type RAIDZ3, which one-ups RAID 6 and can tolerates 3 disk failures.
If you are unsure which pool type you would like to use, there is a very good and detailed comparison here. As the article points out, if you can afford it striped mirrors (mirrored disks combined into a pool - effectively a RAID 0 of several groups of 2 disks in RAID 1) offers the best performance. However, you'll lose 50% of your usable disk capacity at a minimum, 66% if you want to be able to sustain two drive losses (which I highly recommend you do).
If you don't mind limiting performance to the equivalent of a single disk, RAIDZ2 is your best choice. It offers at worst a 40% loss in usable disk capacity and that number shrinks as if you add more disks. A RAIDZ2 with 6 disks, for example, only loses 2/6 disks to parity (33%). Always remember that RAID is redundancy, not a backup!
Unrecoverable Read Error (URE)
Consumer hardware has become extremely inexpensive for the capacity it can offer, however it's not perfect. All hard disks are manufactured with a mean time between failure (MTBF) and non-recoverable bit error rate specified. MBTF is nothing to worry about, as we can simply swap the disk out for a functioning one when it fails. The point of interest here is the non-recoverable bit error rate, which for consumer disks is typically 1 out of every 10^14 bits read. This means that if you read 10^14 bits from your disk, on average one bit is unrecoverable unreadable and irreparably lost.
This is a significant problem with modern disk sizes, as if a drive in RAID were to fail and be replaced, during the reconstruction process several TB of data from multiple disks would be read and there's a significant (often above 50% - calculator here) that a single URE will be encountered. In a traditional RAID setup, the controller cannot proceed and reconstruction ends. Your data is lost.
However, because ZFS is in control of both the filesystem and disks in a software RAIDZ, it can degrade gracefully should you encounter a URE. it can actually know exactly where that bit fell. Instead of dropping your array, it simply notifies you which file was lost and moves on with the reconstruction. ZFS is also aware of free space, and so doesn't need to waste time reconstructing the free space on a replacement disk.
Although your hardware may support RAID, do not use it. RAIDZ2 is a software RAID implementation that works best when ZFS has direct control over your disks. Running ZFS on top of a hardware RAID array eliminates some of the advantages of ZFS, such as being able to gracefully recover from a Unrecoverable Read Error (URE). More on this below.
If you want to add additional disks and are looking to buy a PCIe add-in card, ensure that you purchase an HBA (Host Bus Adapter) that will present the disks as JBOD and not a RAID-only controller. An excellent HBA card is the IBM M1015 cross-flashed to IT mode which offers excellent performance for the price.
Optimizing the number of disks
In addition to above, consider that number of disks you choose to use in your pool can also have an impact on performance. Adam Nowacki posted this helpful data on the freebsd-fs mailing list (emphasis mine):
Free space calculation is done with the assumption of 128k block size.
Each block is completely independent so sector aligned and no parity
shared between blocks. This creates overhead unless the number of disks
minus raidz level is a power of two. Above that is allocation overhead
where each block (together with parity) is padded to occupy the multiple
of raidz level plus 1 (sectors). Zero overhead from both happens at
raidz1 with 2, 3, 5, 9 and 17 disks and raidz2 with 3, 6 or 18 disks.
Personally, I recommend RAIDZ2 with 6 disks - it offers a very nice balance between the cost of disks, performance and redundancy.
sudo yum localinstall --nogpgcheck http://archive.zfsonlinux.org/fedora/zfs-release$(rpm -E %dist).noarch.rpm
sudo yum install zfs
Reboot your machine and you should be ready to create a zpool.
Create the zpool
Now that ZFS is installed, creating the zpool is relatively straightforward. The ArchLinux Wiki ZFS page details several zpool creation examples.
zpool create -f [poolname] [type] [disks]
zfs set atime=off [poolname]
zfs set compression=lz4 [poolname]
poolname with the name of your zpool (e.g. "data" or "tank"),
[type] with the ZFS pool type (e.g.
raidz2) and finally
[disks] with the disk you wish to use to create the zpool. There are several ways to specify the disks; see the ZFS on Linux FAQ for how to best How to choose device names.
Note that the contents of these disks will be erased and ZFS will resume control over the partition table & disk data.
Create one or more datasets
ZFS datasets (or "filesystems") behave like multiple filesystems on a disk would, except they are all backed by the same storage pool. You can divide your pool into several filesystems, each with different options and mountpoints, and the free space is shared among all filesystems on the pool.
zfs create -o casesensitivity=mixed -o mountpoint=/[poolname]/[dataset] [poolname]/[dataset]
To ensure all disks are synchronized and proactively detect and bit rot, you can automatically scrub disks at night once a week:
cat << EOF > /etc/cron.d/zfs-check
0 0 1 * * root /usr/sbin/zpool scrub [poolname]
Remember to replace
[poolname] as per above. Use
zpool status -v to get the pool status and display any scrub errors.
Receiving email notifications
Installing an MTA
All of ZFS's fancy data protection features are useless if we cannot respond quickly to a problem. Since Fedora 20 does not include a Mail Transfer Agent (MTA) by default, install one now to ensure we can receive email notifications when a disk goes bad:
yum install postfix
cat << EOF >> /etc/postfix
myhostname = yourname.dyndns.org
relayhost = mailserver.com:port
systemctl enable postfix
systemctl start postfix
You need to configure your
myhostname to be something valid; in this case, I have chosen a free DynDNS hostname. Most ISPs block port 25, so you will need to use their mail server coordinates for
relayhost or alternatively, you can always setup a free GMail account and use GMail as your relay on an alternate port (e.g. 587).
Monitoring SMART disk health information
smartd daemon an monitor your disks health and notify you immediately should an error turn up.
yum install smartmontools
systemctl enable smartd
/etc/smartmontools/smartd.conf and change the
-m root flag to point to the desired email address, for example
-m email@example.com. To test if notifications are working correctly, add the line
DEVICESCAN -H -m firstname.lastname@example.org -M test to the configuration and then restart smartd:
systemctl restart smartd
- FreeNAS Forums
- ArchLinux Wiki article on ZFS
- Solaris Internal's ZFS Best Practices guide
zfs record size
There is no need to reduce the record size for lots of small files as it actually refers to the max record size, not the actual recordsize. In you example 32k files would still have 32k records. Setting the recordsize max is of more use when you have say a database writting in 8k blocks, but multiple blocks at a time. The initial write is fine at 128k, as the 16x8k blocks go into one 128k record. However if the db wants the 3rd 8k chunk the whole 128k is needed to be read. If this 3rd block is modified then the whole 128k is read again and then written out again. Not good. So unless the user is updating lots of these 32k files simultaneously, it make little sense to reduce the recordsize for the reason you stated
Thanks for the correction - I've updated my post to remove the recordsize adjustments.
altering the record size is
altering the record size is fine, as long as its dont for the correct reasons