How I learned to stop worrying and love dual parity RAID
It’s been assumed for a long time, by myself and others, that RAID 1+0 is the obvious choice for enterprise storage from a performance and reliability point of view. I’ve recently had cause to review this opinion and (re)read some articles about the subject.
Read on for links to some interesting articles and my conclusions.
First a quick word about terminology. Dual Parity raid is often implemented as RAID 6, this involves having 2 disks for parity rather than the single disk in RAID 5. I’m going to look at articles regarding two vendor implementations of dual parity raid, these differ from traditional RAID 6 in a number of ways. They are NetApp’s RAID-DP and Sun’s raidz2.
I came across this article from one of Sun’s engineers some time ago, and the project I’m working on caused me to review it again. The article compares two MTTDL models, one ‘traditional’ mode, which takes Mean Time Before Failure (MTBF) and Mean Time To Repair (MTTR) into account, and a second more interesting one that takes into account the unrecoverable error rate (UER).
It is the UER that I find particularly interesting, this is the rate at which a read for a bit will result in an unrecoverable error. In a single parity system such as RAID 1+0 or RAID5 you need your parity data to be 100% readable in the event of a disk failure,otherwise your going to be looking at data loss, which will inevitably be in the worst location.
Disk sizes continue their march towards ever larger capacities (a possible move to smaller, faster SSD drives excepted). However the UER has remained fairly flat. The linked Sun blog above has this to say
Typically, UER will be 1 per 10^14 bits read for consumer class drives and 1 per 10^15 for enterprise class drives. This can be alarming, because you could also say that consumer class drives should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte drives are readily available and 1 TByte drives are announced. Most people will be unhappy if they get an unrecoverable read error once every dozen or so times they read the whole disk.
This NetApp white paper about their implementation of RAID-DP has this to say:
As disk drives have gotten larger, their reliability has not improved, and, more importantly, the bit error likelihood per drive has increased proportionally with the larger media.
As we push drives sizes higher and higher we increase the likelihood of finding an unreadable bit and have to consider the real possibility that we cannot recover a volume that was protected with single parity.
So what can we do about it. A three-way mirror is an option, but try explaining the cost of that to your customer. A more practical solution is to consider double parity RAID. With this configuration you can lose an entire drive, and if you hit an unreadable bit you still have one more location to retrieve that data from before it is lost.
“But what about write performance” I hear you cry! In a traditional RAID5/6 implementation if you make writes that are not across a full stripe then there performance can tank considerably, increasingly so as you add more spindles. The reason for this is that for a partial stripe write, the section of the stripe that is not being written to must first be read from disk in order for the parity to be calculated. Then the new data and parity can be written out to disk.
ZFS’s raidz2 has a solution for this and that is that all writes are made as full stripe writes. This is achieved by dynamically altering the width of the stripe to match the write being made. This also neatly works around the RAID5/6 write hole. Have a read of this document if you’d like more info about this.
NetApp’s RAID-DP is actually an extension of RAID4 rather than RAID5. This means that there a two dedicated columns for parity, rather than a distributed layout. The aforementioned white paper has more detail as to the specific layout used. NetApp also uses a file system called WAFL, I’ve been unable to easily find any documentation around this and partial stripe write performance, please comment if you know of any.
The whole issue with partial stripe write is obviated somewhat in an environment where you are supporting a single application, such as Oracle, where the block size is known and configurable. In this instance the Oracle block size can be configured in partnership with the storage to eliminate (or greatly reduce) the need for partial stripe writes.
Read performance from a raidz2 device is not as great as you might imagine, typically it will be limited to the performance of a single disk, for reasons that are explained in this article.
I expect that the read performance of RAID-DP matches that of a traditional RAID4/5 system much more closely, i.e. that reads can be striped across all the spindles. However I don’t have any stats to back this up.
I would like to have the opportunity to benchmark the systems against each other with a realistic workload but at the moment I don’t have that option. For me the increased data reliability of a dual parity system, when faced with an increasing UER per disk, is enough to covert me, even if it means a loss of some performance. Providing the performance the array can deliver is adequate then I see dual parity RAID as worthy of significant consideration.
Posted in Solaris