A couple of years ago I was lucky enough to have the opportunity to purchase a second hand hardware raid 5 card (an Areca 1220) for a very low price. Since then I have used the card along with several sets of drives (originally 300gig, then 750gig and currently 1Tb) in a dedicated server PC as a large network file-store for the family’s music, photos, videos and back-ups.
Before I obtained the hardware card I tried using software raid, but found the results very disappointing. The server has a low power, single core cpu which isn’t really up to the task of acting as a raid‑5 engine. Whilst I’ve heard plenty of times that RAID isn’t a back-up, this is a case where only a cheap solution will do. RAID-5 offers protection from single drive failure, which is good enough for my purposes. The dedicated card offers an enormous performance advantage, but in practice this isn’t very important. The features it adds however are! The Areca card offers an OS independent raid solution which counts for a lot. It also offers online capacity expansion and raid-level migration (so, for example, I could upgrade to raid‑6). Both of these features are much less simple with cheaper solutions.
So, you might think, what’s the problem. The answer: the lack of options from Hard Disk manufacturers…
Ever since using the Areca card I have suffered from occasional drive “failures”. Upon powering off and on the drive reappears as fully functional. I then have to spend many hours rebuilding the array from degraded back to normal. After much searching I have diagnosed the problem, but am unable to properly solve it.
Hard Drive manufacturers provide a range of drives for different purposes. The typical drives most of us buy are consumer level drives. The manufacturers also offer enterprise-class drives designed for servers which have intensive use patterns and 24.7 uptime. These drives are often physically identical, but have undergone additional testing and are supplied with slightly different firmware, optimised for server workloads.
One of these features is Error Recovery Control (ERC). This feature is also called CCTL (Command Completion Time Limit) by Samsung and Hitachi and TLER (Time-Limited Error Recovery) by Western Digital. All drives suffer the occasional error at a physical level, which could be caused by things like stray cosmic rays. These errors are handled by redundancy built into the way the drive stores data, but occasionally one can be severe enough to cause problems reading data. Normal consumer drives will spend a prolonged period attempting to read the damaged data to recover it. They then map it to a new part of the drive and everything continues as normal. However, this delay can cause severe problems in enterprise environments, so enterprise drives will time-out their self-repair attempts after a short period (usually 7 seconds or so) and report the error to the raid controller. The raid controller then handles the error by recalculating the data using the other drives in the array. This prevents large delays in sending data, but requires the presence of other drives and a raid controller.
So, I have a proper hardware raid card. It expects to hear back from drives within no more than 7–8 seconds regardless of an error. I also have consumer hard drives, which attempt to repair their own errors for a long period. So when an error occurs the drive tries to fix it, doesn’t respond within 7–8 seconds, and the raid controller than assumes the drive has failed and kicks it out of the array.
So, the obvious solutions would be either to tell the raid controller to wait longer without kicking a drive out, OR tell the drive to give up after 7 seconds like an enterprise drive… Infuriatingly, neither is possible!
I have searched extensively, but I can’t find any proper raid‑5 cards which allow the user to change how long they will wait for a drive. In the past there were some WD drives which could have the TLER feature enabled with a utility released by WD called WD-TLER, but recently WD have disabled this option, presumably to “protect” the huge markup on their enterprise drives (which are double the price for the same hardware)
Some people have found ways to temporarily enable ERC on some drives using either HDAT2, SmartCTL or hdparm, however these do not support my RAID card under Windows, and the change is lost if the PC is power cycled.
For users like myself that need a large capacity storage, and the features offered by a hardware raid‑5 solution, but that do not need 24.7 uptime, long warranties or drives designed for heavy duty usage there is currently NO appropriate solution. Its about time either a drive manufacturer addressed this market (by releasing a consumer drive with ERC enabled for a small, e.g. 15%, premium) or a raid-card manufacturer addressed the market by offering a card with the option to increase the time before drives are timed out. Creating either of these solutions is trivial, a simple firmware tweak would do the job.
Until then, I advise others to avoid using hardware raid cards with consumer drives, and given the price premium of enterprise drives I recommending avoiding hardware raid altogether.
Does Hitachi support CCTL via their feature tool? An e‑mail reply posted at HardForum seems to suggest so. Time to get a confirmation…